Re: [ClusterLabs] Corosync crash

2019-05-07 Thread Valentin Vidic
On Tue, May 07, 2019 at 09:59:03AM +0300, Klecho wrote:
> During the weekend my corosync daemon suddenly died without anything in the
> logs, except this:
> 
> May  5 20:39:16 ZZZ kernel: [1605277.136049] traps: corosync[2811] trap
> invalid opcode ip:5635c376f2eb sp:7ffc3e109950 error:0 in
> corosync[5635c3745000+47000]
> 
> The version is corosync 2.4.4-3 amd64 standard Debian stretch

Don't recall seeing crashes like than with the stretch release, any
chance you can run a memtest on that machine?

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Issue with DB2 HADR cluster

2019-04-03 Thread Valentin Vidic
On Wed, Apr 03, 2019 at 09:13:58AM +0200, Ulrich Windl wrote:
> I'm surprised: Once sbd writes the fence command, it usually takes
> less than 3 seconds until the victim is dead. If you power off a
> server, the PDU still may have one or two seconds "power reserve", so
> the host may not be down immediately. Besides of that power-cycles are
> additional stress for the hardware...
> 
> So maybe you want to explain why and how much faster IPMI and PDU fencing are.

SBD is slow for me too. Since it doesn't have a way to confirm the kill
it needs to wait for various timeouts and these can be quite high. For
example the IBM storage timeouts require this setup:

  Timeout (watchdog) : 130
  Timeout (msgwait)  : 270

On the same cluster IPMI fence executes in a second or two, but requires
network connectivity.

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 07:31:02PM +0100, Valentin Vidic wrote:
> Right, but I'm not sure how this would help in the above situation
> unless the DRBD can undo the local write that did not succeed on the
> peer?

Ah, it seems the activity log handles the undo by storing the
location of these dirty blocks (not replicated properly):

  https://docs.linbit.com/docs/users-guide-8.4/#s-activity-log

So on resync these blocks would get copied from the new primary.

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 02:01:07PM -0400, Digimer wrote:
> On 2019-03-20 2:00 p.m., Valentin Vidic wrote:
> > On Wed, Mar 20, 2019 at 01:47:56PM -0400, Digimer wrote:
> >> Not when DRBD is configured correctly. You sent 'fencing
> >> resource-and-stonith;' and set the appropriate fence handler. This tells
> >> DRBD to not proceed with a write while a node is in an unknown state
> >> (which happens when the node stops responding and is cleared on
> >> successful fence).
> > 
> > The situation I had in mind is this: node1 is Primary and sends a write to
> > local disk and over the DRBD link. The network fails so the write is
> > successfull only on the local disk. node1 is than fenced and node2 takes
> > over but the disk have now diverged.
> 
> This is handled by using Protocol C in DRBD. That tells DRBD to not
> consider a write complete until it has hit persistent storage on both nodes.

Right, but I'm not sure how this would help in the above situation
unless the DRBD can undo the local write that did not succeed on the
peer?

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 01:47:56PM -0400, Digimer wrote:
> Not when DRBD is configured correctly. You sent 'fencing
> resource-and-stonith;' and set the appropriate fence handler. This tells
> DRBD to not proceed with a write while a node is in an unknown state
> (which happens when the node stops responding and is cleared on
> successful fence).

The situation I had in mind is this: node1 is Primary and sends a write to
local disk and over the DRBD link. The network fails so the write is
successfull only on the local disk. node1 is than fenced and node2 takes
over but the disk have now diverged.

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 01:44:06PM -0400, Digimer wrote:
> GFS2 notified the peers of disk changes, and DRBD handles actually
> copying to changes to the peer.
> 
> Think of DRBD, in this context, as being mdadm RAID, like how writing to
> /dev/md0 is handled behind the scenes to write to both /dev/sda3 +
> /dev/sdb3. DRBD is like the same, any writes to /dev/drbd0 is written to
> both node1:/dev/sda3 + node2:/dev/sda3.
> 
> So DRBD handles replication, and GFS2 handles coordination.

Yes, I was thinking more of the GFS2 in the shared storage setup, how
much overhead is there if the cluster nodes all write to different files
like VM images?

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 01:34:52PM -0400, Digimer wrote:
> Depending on your fail-over tolerances, I might add NFS to the mix and
> have the NFS server run on one node or the other, exporting your ext4 FS
> that sits on DRBD in single-primary mode.
> 
> The failover (if the NFS host died) would look like this;
> 
> 1. Lost node is fenced.
> 2. DRBD is promoted from Secondary to Primary
> 3. ext4 FS is mounted.
> 4. Virtual IP (used for NFS) is brought up.
> 5. NFS starts
> 
> Startup and graceful migration would be the same, minus the fence.

Would it be possible for DRBD to go into SplitBrain if the lost node
manages to write something to local DRBD disk before it gets fenced?

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 12:37:21PM -0400, Digimer wrote:
>   Cluster filesystems are amazing if you need them, and to be avoided if
> at all possible. The overhead from the cluster locking hurts performance
> quite a lot, and adds a non-trivial layer of complexity.
> 
>   I say this as someone who has used dual-primary DRBD with GFS2 for
> many years.

If the GFS2 holds qcow2 images, does node1 need to synchronize writes
to vm1.qcow2 with node2 writing to vm2.qcow2?

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 09:36:58AM -0600, JCA wrote:
>  # pcs -f fs_cfg resource create TestFS Filesystem device="/dev/drbd1"
> directory="/tmp/Testing"
> fstype="ext4"

ext4 can only be mounted on one node at a time. If you need to access
files on both nodes at the same time than a cluster filesystem should
be used (GFS2, OCFS2).

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 10:23:17PM +, Eric Robinson wrote:
> I'm looking through the docs but I don't see how to set the on-fail value for 
> a resource. 

It is not set on the resource itself but on each of the actions (monitor, 
start, stop). 

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 09:33:42PM +, Eric Robinson wrote:
> I just noticed that. I also noticed that the lsb init script has a
> hard-coded stop timeout of 30 seconds. So if the init script waits
> longer than the cluster resource timeout of 15s, that would cause the

Yes, you should use higher timeouts in pacemaker (45s for example).

> resource to fail. However, I don't want cluster failover to be
> triggered by the failure of one of the MySQL resources. I only want
> cluster failover to occur if the filesystem or drbd resources fail, or
> if the cluster messaging layer detects a complete node failure. Is
> there a way to tell PaceMaker not to trigger cluster failover if any
> of the p_mysql resources fail?  

You can try playing with the on-fail option but I'm not sure how
reliably this whole setup will work without some form of fencing/stonith.

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 09:03:43PM +, Eric Robinson wrote:
> Here are the relevant corosync logs.
> 
> It appears that the stop action for resource p_mysql_002 failed, and
> that caused a cascading series of service changes. However, I don't
> understand why, since no other resources are dependent on p_mysql_002.

The stop failed because of a timeout (15s), so you can try to update
that value:

  Result of stop operation for p_mysql_002 on 001db01a: Timed Out | call=1094 
key=p_mysql_002_stop_0 timeout=15000ms

After the stop failed it should have fenced that node, but you don't
have fencing configured so it tries to move mysql_002 and all the
other resources related to it (vip, fs, drbd) to the other node.
Since other mysql resources depend on the same (vip, fs, drbd) they
need to be stopped first.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 08:50:57PM +, Eric Robinson wrote:
> Which logs? You mean /var/log/cluster/corosync.log?

On the DC node pacemaker will be logging the actions it is trying
to run (start or stop some resources).

> But even if the stop action is resulting in an error, why would the
> cluster also try to stop the other services which are not dependent?

When the resource is failed, pacemaker might still try to run stop for
that resource. If the lsb script is not correct that might also stop
other mysql resources. But this should all be reported in the pacemaker
log.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 08:34:21PM +, Eric Robinson wrote:
> Why is it that when one of the resources that start with p_mysql_*
> goes into a FAILED state, all the other MySQL services also stop?

Perhaps stop is not working correctly for these lsb services, so for
example stopping lsb:mysql_004 also stops the other lsb:mysql_nnn.

You would need to send the logs from the event to confirm this.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Announcing hawk-apiserver, now in ClusterLabs

2019-02-12 Thread Valentin Vidic
On Tue, Feb 12, 2019 at 08:00:38PM +0100, Kristoffer Grönlund wrote:
> One final note: hawk-apiserver uses a project called go-pacemaker
> located at https://github.com/krig/go-pacemaker. I indend to transfer
> this to ClusterLabs as well. go-pacemaker is still somewhat rough around
> the edges, and our plan is to work on the C API of pacemaker to make
> using and exposing it via Go easier, as well as moving functionality
> from crm_mon into the C API so that status information can be made
> available in a more convenient format via the API as well.

So no Lisp here?  Just kidding, great work and looking forward to
trying it out :)

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Trying to Understanding crm-fence-peer.sh

2019-01-16 Thread Valentin Vidic
On Wed, Jan 16, 2019 at 04:20:03PM +0100, Valentin Vidic wrote:
> I think drbd always calls crm-fence-peer.sh when it becomes disconnected
> primary.  In this case storage1 has closed the DRBD connection and
> storage2 has become a disconnected primary.
> 
> Maybe the problem is the order that the services are stopped during
> reboot. It would seem that drbd is shutdown before pacemaker. You
> can try to run manually:
> 
>   pacemaker stop
>   corosync stop
>   drbd stop
> 
> and see what happens in this case.

Some more info here:

https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_drbd_fencing.html

So storage2 does not know why the other end disappeared and tries to use
pacemaker to prevent storage1 from ever becoming a primary.  Only when
it comes back online and gets in sync it is allowed to start again as a
pacemaker resource by a second script:

  after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Trying to Understanding crm-fence-peer.sh

2019-01-16 Thread Valentin Vidic
On Wed, Jan 16, 2019 at 09:03:21AM -0600, Bryan K. Walton wrote:
> The exit code 4 would seem to suggest that storage1 should be fenced.
> But the switch ports connected to storage1 are still enabled.
> 
> Am I misreading the logs here?  This is a clean reboot, maybe fencing
> isn't supposed to happen in this situation?  But the logs seem to
> suggest otherwise.

I think drbd always calls crm-fence-peer.sh when it becomes disconnected
primary.  In this case storage1 has closed the DRBD connection and
storage2 has become a disconnected primary.

Maybe the problem is the order that the services are stopped during
reboot. It would seem that drbd is shutdown before pacemaker. You
can try to run manually:

  pacemaker stop
  corosync stop
  drbd stop

and see what happens in this case.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Unexpected resource restart

2019-01-16 Thread Valentin Vidic
On Wed, Jan 16, 2019 at 12:41:11PM +0100, Valentin Vidic wrote:
> This is what pacemaker says about the resource restarts:
> 
> Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Start  dlm:1 
>   ( node2 )  
> Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Start  
> lockd:1 ( node2 )  
> Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Restart
> gfs2-lvm:0  ( node1 )   due to required storage-clone running
> Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Restart
> gfs2-fs:0   ( node1 )   due to required gfs2-lvm:0 start
> Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Start  
> gfs2-lvm:1  ( node2 )  
> Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Start  
> gfs2-fs:1   ( node2 )  
> Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Restart
> ocfs2-lvm:0 ( node1 )   due to required storage-clone running
> Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Restart
> ocfs2-fs:0  ( node1 )   due to required ocfs2-lvm:0 start
> Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Start  
> ocfs2-lvm:1 ( node2 )  
> Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Start  
> ocfs2-fs:1  ( node2 )  

It seems interleave was required an gfs2 and ocfs2 clones:

  interleave (default: false)
  If this clone depends on another clone via an ordering constraint, is
  it allowed to start after the local instance of the other clone starts, rather
  than wait for all instances of the other clone to start?

Now it behaves as expected when the node2 is set online:

Jan 16 12:35:33 node1 pacemaker-schedulerd[564]:  notice:  * Start  dlm:1   
( node2 )  
Jan 16 12:35:33 node1 pacemaker-schedulerd[564]:  notice:  * Start  lockd:1 
( node2 )  
Jan 16 12:35:33 node1 pacemaker-schedulerd[564]:  notice:  * Start  
gfs2-lvm:1  ( node2 )  
Jan 16 12:35:33 node1 pacemaker-schedulerd[564]:  notice:  * Start  
gfs2-fs:1   ( node2 )  
Jan 16 12:35:33 node1 pacemaker-schedulerd[564]:  notice:  * Start  
ocfs2-lvm:1 ( node2 )  
Jan 16 12:35:33 node1 pacemaker-schedulerd[564]:  notice:  * Start  
ocfs2-fs:1  ( node2 )  

 Clone: gfs2-clone
  Meta Attrs: interleave=true target-role=Started
  Group: gfs2
   Resource: gfs2-lvm (class=ocf provider=heartbeat type=LVM-activate)
Attributes: activation_mode=shared vg_access_mode=lvmlockd vgname=vgshared 
lvname=gfs2
   Resource: gfs2-fs (class=ocf provider=heartbeat type=Filesystem)
Attributes: directory=/srv/gfs2 fstype=gfs2 device=/dev/vgshared/gfs2

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Unexpected resource restart

2019-01-16 Thread Valentin Vidic
On Wed, Jan 16, 2019 at 12:16:04PM +, Andrew Price wrote:
> The only thing that stands out to me with this config is the lack of
> ordering constraint between dlm and lvmlockd. Not sure if that's the issue
> though.

They are both in the storage group so the order should be dlm than lockd?

   Clone: storage-clone
Meta Attrs: interleave=true target-role=Started
Group: storage
 Resource: dlm (class=ocf provider=pacemaker type=controld)
 Resource: lockd (class=ocf provider=heartbeat type=lvmlockd)

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Unexpected resource restart

2019-01-16 Thread Valentin Vidic
On Wed, Jan 16, 2019 at 12:28:59PM +0100, Valentin Vidic wrote:
> When node2 is set to standby resource stop running there.  However when
> node2 is brought back online, it causes the resources on node1 to stop
> and than start again which is a bit unexpected?
> 
> Maybe the dependency between the common storage group and the upper
> gfs2/ocfs2 groups could be written in some other way to prevent this
> resource restart?

This is what pacemaker says about the resource restarts:

Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Start  dlm:1   
( node2 )  
Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Start  lockd:1 
( node2 )  
Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Restart
gfs2-lvm:0  ( node1 )   due to required storage-clone running
Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Restart
gfs2-fs:0   ( node1 )   due to required gfs2-lvm:0 start
Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Start  
gfs2-lvm:1  ( node2 )  
Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Start  
gfs2-fs:1   ( node2 )  
Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Restart
ocfs2-lvm:0 ( node1 )   due to required storage-clone running
Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Restart
ocfs2-fs:0  ( node1 )   due to required ocfs2-lvm:0 start
Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Start  
ocfs2-lvm:1 ( node2 )  
Jan 16 11:19:08 node1 pacemaker-schedulerd[713]:  notice:  * Start  
ocfs2-fs:1  ( node2 )  

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Unexpected resource restart

2019-01-16 Thread Valentin Vidic
Hi all,

I'm testing the following configuration with two nodes:

 Clone: storage-clone
  Meta Attrs: interleave=true target-role=Started
  Group: storage
   Resource: dlm (class=ocf provider=pacemaker type=controld)
   Resource: lockd (class=ocf provider=heartbeat type=lvmlockd)

 Clone: gfs2-clone
  Group: gfs2
   Resource: gfs2-lvm (class=ocf provider=heartbeat type=LVM-activate)
Attributes: activation_mode=shared vg_access_mode=lvmlockd vgname=vgshared 
lvname=gfs2
   Resource: gfs2-fs (class=ocf provider=heartbeat type=Filesystem)
Attributes: directory=/srv/gfs2 fstype=gfs2 device=/dev/vgshared/gfs2

 Clone: ocfs2-clone
  Group: ocfs2
   Resource: ocfs2-lvm (class=ocf provider=heartbeat type=LVM-activate)
Attributes: activation_mode=shared vg_access_mode=lvmlockd vgname=vgshared 
lvname=ocfs2
   Resource: ocfs2-fs (class=ocf provider=heartbeat type=Filesystem)
Attributes: directory=/srv/ocfs2 fstype=ocfs2 device=/dev/vgshared/ocfs2

Ordering Constraints:
  storage-clone then gfs2-clone (kind:Mandatory) (id:gfs2_after_storage)
  storage-clone then ocfs2-clone (kind:Mandatory) (id:ocfs2_after_storage)
Colocation Constraints:
  gfs2-clone with storage-clone (score:INFINITY) (id:gfs2_with_storage)
  ocfs2-clone with storage-clone (score:INFINITY) (id:ocfs2_with_storage)

When node2 is set to standby resource stop running there.  However when
node2 is brought back online, it causes the resources on node1 to stop
and than start again which is a bit unexpected?

Maybe the dependency between the common storage group and the upper
gfs2/ocfs2 groups could be written in some other way to prevent this
resource restart?

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Status of Pacemaker 2 support in SBD?

2019-01-11 Thread Valentin Vidic
On Fri, Jan 11, 2019 at 12:42:02PM +0100, wf...@niif.hu wrote:
> I opened https://github.com/ClusterLabs/sbd/pull/62 with our current
> patches, but I'm just a middle man here.  Valentin, do you agree to
> upstream these two remaining patches of yours?

Sure thing, merge anything you can...

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrading from CentOS 6 to CentOS 7

2019-01-03 Thread Valentin Vidic
On Thu, Jan 03, 2019 at 04:56:26PM -0600, Ken Gaillot wrote:
> Right -- not only that, but corosync 1 (CentOS 6) and corosync 2
> (CentOS 7) are not compatible for running in the same cluster.

I suppose it is the same situation for upgrading from corosync 2
to corosync 3?

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops

2018-11-13 Thread Valentin Vidic
On Tue, Nov 13, 2018 at 11:01:46AM -0600, Ken Gaillot wrote:
> Clone instances have a default stickiness of 1 (instead of the usual 0)
> so that they aren't needlessly shuffled around nodes every transition.
> You can temporarily set an explicit stickiness of 0 to let them
> rebalance, then unset it to go back to the default.

Thanks, this works as expected now:

  clone cip-clone cip \
meta clone-max=2 clone-node-max=2 globally-unique=true interleave=true \
 resource-stickiness=0 target-role=Started

Clone instance moves when a node is down but also returns when the node
is back online.

Do you perhaps know if CLUSTERIP has any special network requirements to
work properly?

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops

2018-11-13 Thread Valentin Vidic
On Tue, Nov 13, 2018 at 05:04:19PM +0100, Valentin Vidic wrote:
> Also it seems to require multicast, so better check for that too :)

And while the CLUSTERIP resource seems to work for me in a test
cluster, the following clone definition:

  clone cip-clone cip \
meta clone-max=2 clone-node-max=2 globally-unique=true interleave=true 
target-role=Started

allows for both clone instances to end up on the same node:

 Clone Set: cip-clone [cip] (unique)
 cip:0  (ocf::heartbeat:IPaddr2):   Started sid2
 cip:1  (ocf::heartbeat:IPaddr2):   Started sid2

Is there a way to spread the resources other than setting
clone-node-max=1 for a while?

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops

2018-11-13 Thread Valentin Vidic
On Tue, Nov 13, 2018 at 04:06:34PM +0100, Valentin Vidic wrote:
> Could be some kind of ARP inspection going on in the networking equipment,
> so check switch logs if you have access to that.

Also it seems to require multicast, so better check for that too :)

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops

2018-11-13 Thread Valentin Vidic
On Tue, Nov 13, 2018 at 09:06:56AM -0500, Daniel Ragle wrote:
> Thanks, finally getting back to this. Putting a tshark on both nodes and
> then restarting the VIP-clone resource shows the pings coming through for 12
> seconds, always on node2, then stop. I.E., before/after those 12 seconds
> nothing on either node from the server initiating the pings.

Could be some kind of ARP inspection going on in the networking equipment,
so check switch logs if you have access to that.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [ClusterLabs Developers] resource-agents v4.2.0 rc1

2018-10-19 Thread Valentin Vidic
On Fri, Oct 19, 2018 at 11:09:34AM +0200, Kristoffer Grönlund wrote:
> I wonder if perhaps there was a configuration change as well, since the
> return code seems to be configuration related. Maybe something changed
> in the build scripts that moved something around? Wild guess, but...

Seems to be a problem with the agent script in the end:

https://github.com/ClusterLabs/resource-agents/pull/1254

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] resource-agents v4.2.0 rc1

2018-10-18 Thread Valentin Vidic
On Wed, Oct 17, 2018 at 12:03:18PM +0200, Oyvind Albrigtsen wrote:
>  - apache: retry PID check.

I noticed that the ocft test started failing for apache in this
version. Not sure if the test is broken or the agent. Can you
check if the test still works for you? Restoring the previous
version of the agent fixes the problem for me.

# ocft test -v apache
Initializing 'apache' ...
Done.

Starting 'apache' case 0 'check base env':
ERROR: './apache monitor' failed, the return code is 2.
Starting 'apache' case 1 'check base env: set non-existing 
OCF_RESKEY_statusurl':
ERROR: './apache monitor' failed, the return code is 2.
Starting 'apache' case 2 'check base env: set non-existing 
OCF_RESKEY_configfile':
ERROR: './apache monitor' failed, the return code is 2.
Starting 'apache' case 3 'normal start':
ERROR: './apache monitor' failed, the return code is 2.
Starting 'apache' case 4 'normal stop':
ERROR: './apache monitor' failed, the return code is 2.
Starting 'apache' case 5 'double start':
ERROR: './apache monitor' failed, the return code is 2.
Starting 'apache' case 6 'double stop':
ERROR: './apache monitor' failed, the return code is 2.
Starting 'apache' case 7 'running monitor':
ERROR: './apache monitor' failed, the return code is 2.
Starting 'apache' case 8 'not running monitor':
ERROR: './apache monitor' failed, the return code is 2.
Starting 'apache' case 9 'unimplemented command':
ERROR: './apache monitor' failed, the return code is 2.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] LIO iSCSI target fails to start

2018-10-11 Thread Valentin Vidic
On Wed, Oct 10, 2018 at 02:36:21PM +0200, Stefan K wrote:
> I think my config is correct, but it sill fails with "This Target
> already exists in configFS" but "targetcli ls" shows nothing.

It seems to find something in /sys/kernel/config/target.  Maybe it
was setup outside of pacemaker somehow?

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops

2018-10-11 Thread Valentin Vidic
On Thu, Oct 11, 2018 at 01:25:52PM -0400, Daniel Ragle wrote:
> For the 12 second window it *does* work in, it appears as though it works
> only on one of the two servers (and always the same one). My twelve seconds
> of pings runs continuously then stops; while attempts to hit the Web server
> works hit or miss depending on my source port (I'm using
> sourceip-sourceport). I.E., as if anything that would be handled by the
> other server isn't making it through. But after the 12 seconds neither
> server responds to the requests against the VIP (but they do respond fine to
> their own static IPs at all times).

Could be that the switch in front of the servers does not like to see
the same MAC on two ports or something like that.

> During the 12 seconds that it works I get these in the logs of the server
> that *is* responding:
> 
> Oct 11 12:17:43 node2 kernel: ipt_CLUSTERIP: unknown protocol 1
> Oct 11 12:17:44 node2 kernel: ipt_CLUSTERIP: unknown protocol 1
> Oct 11 12:17:45 node2 kernel: ipt_CLUSTERIP: unknown protocol 1

Protocol 1 once per second should be ICMP PING so this is just CLUSTERIP
complaining that it can't calculate sourceip-sourceport for those packets
(ICMP has no source port).

So maybe try recording the traffic using tcpdump on both servers and
see if any requests are comming in at all from the network equipment.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [ClusterLabs Developers] fence-agents v4.3.0

2018-10-09 Thread Valentin Vidic
On Tue, Oct 09, 2018 at 12:07:38PM +0200, Oyvind Albrigtsen wrote:
> I've created a PR for the library detection and try/except imports:
> https://github.com/ClusterLabs/fence-agents/pull/242

Thanks, I will give it a try right away...

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [ClusterLabs Developers] fence-agents v4.3.0

2018-10-09 Thread Valentin Vidic
On Tue, Oct 09, 2018 at 10:55:08AM +0200, Oyvind Albrigtsen wrote:
> It seems like the if-line should be updated to check for those 2
> libraries (from the imports in the agent).

Yes, that might work too.

Also would it be possible to make the imports in openstack agent
conditional so the metadata works even when the python modules
are not installed?

Something like this in aliyun:

try:
from aliyunsdkcore import client

from aliyunsdkecs.request.v20140526.DescribeInstancesRequest import 
DescribeInstancesRequest
from aliyunsdkecs.request.v20140526.StartInstanceRequest import 
StartInstanceRequest
from aliyunsdkecs.request.v20140526.StopInstanceRequest import 
StopInstanceRequest
from aliyunsdkecs.request.v20140526.RebootInstanceRequest import 
RebootInstanceRequest
except ImportError:
pass

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [ClusterLabs Developers] fence-agents v4.3.0

2018-10-09 Thread Valentin Vidic
On Tue, Oct 02, 2018 at 03:13:51PM +0200, Oyvind Albrigtsen wrote:
> ClusterLabs is happy to announce fence-agents v4.3.0.
> 
> The source code is available at:
> https://github.com/ClusterLabs/fence-agents/releases/tag/v4.3.0
> 
> The most significant enhancements in this release are:
> - new fence agents:
>  - fence_aliyun
>  - fence_openstack

Could not get openstack to build without a patch.  Can you check
if this works for you:

--- a/configure.ac
+++ b/configure.ac
@@ -246,8 +246,7 @@
 fi
 
 if echo "$AGENTS_LIST" | grep -q openstack; then
-AC_PYTHON_MODULE(novaclient)
-AC_PYTHON_MODULE(keystoneclient)
+AC_PYTHON_MODULE(openstackclient)
 if test "x${HAVE_PYMOD_OPENSTACKCLIENT}" != xyes; then
 AGENTS_LIST=$(echo "$AGENTS_LIST" | sed -E 
"s#openstack/fence_openstack.py( |$)##")
 AC_MSG_WARN("Not building fence_openstack")

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Position of pacemaker in today's HA world

2018-10-05 Thread Valentin Vidic
On Fri, Oct 05, 2018 at 11:34:10AM -0500, Ken Gaillot wrote:
> The next big challenge is that high availability is becoming a subset
> of the "orchestration" space in terms of how we fit into IT
> departments. Systemd and Kubernetes are the clear leaders in service
> orchestration today and likely will be for a long while. Other forms of
> orchestration such as Ansible are also highly relevant. Tighter
> integration with these would go a long way toward establishing
> longevity.

Kubernetes seems to be mostly about stateless services and it kind of
expects you have some external highly available data store.  You can
either buy that data store from your cloud provider or try to make it
yourself, for example using Pacemaker and Galera. Maybe some agents for
registering Pacemaker resources in the Kubernetes etcd or consul would
be useful to connect the two worlds.

On the other side, the big players seem to have settled on using
distributed consensus protocols (Paxos, Raft, Zab) for building
replicated state machines across multiple data centers (or even
continents). It would be an interesting experiment if we could easily
hook up Pacemaker with Zookeeper - selecting a DC node would require
just creating a znode file. Master node can than store the cluster
configuration as another znode and also ask other nodes to do some work
through a job queue. On a global scale this would be quite slow (5-10
changes per second) and node fencing is not available but for some types
of resources it might be useful.

Some more reading material on the topic for the weekend :)

https://landing.google.com/sre/book/chapters/managing-critical-state.html
https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] About fencing stonith

2018-09-26 Thread Valentin Vidic
On Thu, Sep 06, 2018 at 04:47:32PM -0400, Digimer wrote:
> It depends on the hardware you have available. In your case, RPi has no
> IPMI or similar feature, so you'll need something external, like a
> switched PDU. I like the APC AP7900 (or your countries variant), which
> you can often get used for a decent price if this isn't a production system.
> 
> http://www.apc.com/shop/us/en/products/Rack-PDU-Switched-1U-15A-100-120V-85-15/P-AP7900

RPi should have a reset pin, so it might be possible to cross connect
GPIO from one RPi to a reset pin on another and get a cheap reset
functionality.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Valentin Vidic
On Tue, Sep 11, 2018 at 09:31:13AM -0400, Patrick Whitney wrote:
> But, when I invoke the "human" stonith power device (i.e. I turn the node
> off), the other node collapses...
> 
> In the logs I supplied, I basically do this:
> 
> 1. stonith fence (With fence scsi)

After fence_scsi finishes the node should not show any signs of life.
If it continues to work on the network after this point it can cause
trouble.

> 2. verify UI shows fenced node as stopped
> 3. power off fenced node

Not sure if you use poweroff command to shutdown the node or turn it off
some other way?

If you don't have any other fence plugin you can use, try testing with
meatware. Stonith will wait until you manually confirm with meatclient
that the node is down.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Valentin Vidic
On Tue, Sep 11, 2018 at 04:14:08PM +0300, Vladislav Bogdanov wrote:
> And that is not an easy task sometimes, because main part of dlm runs in
> kernel.
> In some circumstances the only option is to forcibly reset the node.

Exactly, killing the power on the node will stop the DLM code running in
the kernel too.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Valentin Vidic
On Tue, Sep 11, 2018 at 09:13:08AM -0400, Patrick Whitney wrote:
> So when the cluster suggests that DLM is shutdown on coro-test-1:
> Clone Set: dlm-clone [dlm]
>  Started: [ coro-test-2 ]
>  Stopped: [ coro-test-1 ]
> 
> ... DLM isn't actually stopped on 1?

If you can connect to the node and see dlm services running than
it is not stopped:

20101 dlm_controld
20245 dlm_scand
20246 dlm_recv
20247 dlm_send
20248 dlm_recoverd

But if you kill the power on the node than it will be gone for sure :)

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Valentin Vidic
On Tue, Sep 11, 2018 at 09:02:06AM -0400, Patrick Whitney wrote:
> What I'm having trouble understanding is why dlm flattens the remaining
> "running" node when the already fenced node is shutdown...  I'm having
> trouble understanding how power fencing would cause dlm to behave any
> differently than just shutting down the fenced node.

fences_scsi just kills the storage on the node, but dlm continues to run
causing problems for the rest of the cluster nodes.  So it seems some
other fence agent should be used that would kill dlm too.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: pcsd processes using 100% CPU

2018-07-23 Thread Valentin Vidic
On Thu, May 24, 2018 at 12:16:16AM -0600, Casey & Gina wrote:
> Tried that, it doesn't seem to do anything but prefix the lines with the pid:
> 
> [pid 24923] sched_yield()   = 0
> [pid 24923] sched_yield()   = 0
> [pid 24923] sched_yield()   = 0

We managed to track this down to a fork bug in some Ruby
versions:

  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=876377

It is now fixed in Debian stretch/stable but other distros
might still have this problem.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Valentin Vidic
On Wed, Jul 11, 2018 at 04:31:31PM -0600, Casey & Gina wrote:
> Forgive me for interjecting, but how did you upgrade on Ubuntu?  I'm
> frustrated with limitations in 1.1.14 (particularly in PCS so not sure
> if it's relevant), and Ubuntu is ignoring my bug reports, so it would
> be great to upgrade if possible.  I'm using Ubuntu 16.04.

pcs is a single package in python and ruby so it should be possible to
try a newer version and see if it helps.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Valentin Vidic
On Wed, Jul 11, 2018 at 08:01:46PM +0200, Salvatore D'angelo wrote:
> Yes, but doing what you suggested the system find that sysV is
> installed and try to leverage on update-rc.d scripts and the failure
> occurs:
> 
> root@pg1:~# systemctl enable corosync
> corosync.service is not a native service, redirecting to systemd-sysv-install
> Executing /lib/systemd/systemd-sysv-install enable corosync
> update-rc.d: error: corosync Default-Start contains no runlevels, aborting.
> 
> the only fix I found was to manipulate manually the header of
> /etc/init.d/corosync adding the rows mentioned below.
> But this is not a clean approach to solve the issue.
> 
> What pacemaker suggest for newer distributions?

You can try using init scripts from the Debian/Ubuntu packages for
corosync and pacemaker as they have the runlevel info included.

Another option is to get the systemd service files working and than
remove the init scripts.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] chap lio-t / iscsitarget disabled - why?

2018-04-03 Thread Valentin Vidic
On Tue, Apr 03, 2018 at 04:48:00PM +0200, Stefan Friedel wrote:
> we've a running drbd - iscsi cluster (two nodes Debian stretch, pacemaker /
> corosync, res group w/ ip + iscsitarget/lio-t + iscsiluns + lvm etc. on top of
> drbd etc.). Everything is running fine - but we didn't manage to get CHAP to
> work. targetcli / lio-t always switches the authentication off after a 
> migration
> or restart.
> 
> I found the following lines in the iSCSITarget resource file (Debian stretch
> /usr/lib/ocf/resource.d/heartbeat/iSCSITarget, also in
> https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/iSCSITarget.in):
> 
>[...]
># TODO: add CHAP authentication support when it gets added back into LIO
>ocf_run targetcli /iscsi/${OCF_RESKEY_iqn}/tpg1/ set attribute 
> authentication=0 || exit $OCF_ERR_GENERIC
>[...]

Yes, another comment in that file suggests that CHAP support was not
available at that time (2009) in lio and/or lio-t:

lio|lio-t)
# TODO: Remove incoming_username and incoming_password
# from this check when LIO 3.0 gets CHAP authentication
unsupported_params="tid incoming_username incoming_password"
;;

If you get it working with the current version of targetcli-fb, you can
create a pull request in the ClusterLabs repo :)

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] False negative from kamailio resource agent

2018-03-23 Thread Valentin Vidic
On Thu, Mar 22, 2018 at 03:36:55PM -0400, Alberto Mijares wrote:
> Straight to the question: how can I manually run a resource agent
> script (kamailio) simulating the pacemaker's environment without
> actually having pacemaker running?

You should be able to run it with something like:

# OCF_ROOT=/usr/lib/ocf OCF_RESKEY_conffile=/etc/kamailio/kamailio.cfg 
/usr/lib/ocf/resource.d/heartbeat/kamailio monitor
INFO: No PID file found and our kamailio instance is not active

> We have this cluster in production and from time to time kamailio
> reported a failure when the reality is that kamailio was still
> running. The failover produces a small downtime unnecessarily, so we
> decided to stop it until we find a solution for that.
> 
> I saw the check functions in the script. They all run OK out of the
> pacemaker's environment, so I need to replicate it by my own.

Do you have anything in the logs that might say why the monitor action
failed for the resource?  Maybe it was overloaded for a moment and did
not respond to sipsak tests.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Valentin Vidic
On Mon, Mar 12, 2018 at 04:31:46PM +0100, Klaus Wenninger wrote:
> Nope. Whenever the cluster is completely down...
> Otherwise nodes would come up - if not seeing each other -
> happily with both starting all services because they don't
> know what already had been running on the other node.
> Technically it wouldn't even be possible to remember that
> they've seen once as Corosync doesn't have "non-volatile-storage"
> apart from the config-file.

Interesting, I have the following config in a test cluster:

nodelist {
node {
ring0_addr: sid1
nodeid: 1
}

node {
ring0_addr: sid2
nodeid: 2
}
}

quorum {

# Enable and configure quorum subsystem (default: off)
# see also corosync.conf.5 and votequorum.5
provider: corosync_votequorum
expected_votes: 1
two_node: 1
}

And the behaviour when both nodes are down seems to be:

1. One node up
2. Fence other node
3. Start services

Mar 12 18:15:01 sid1 crmd[555]:   notice: Connecting to cluster infrastructure: 
corosync
Mar 12 18:15:01 sid1 crmd[555]:   notice: Quorum acquired
Mar 12 18:15:01 sid1 crmd[555]:   notice: Node sid1 state is now member
Mar 12 18:15:01 sid1 crmd[555]:   notice: State transition S_STARTING -> 
S_PENDING
Mar 12 18:15:23 sid1 crmd[555]:  warning: Input I_DC_TIMEOUT received in state 
S_PENDING from crm_timer_popped
Mar 12 18:15:23 sid1 crmd[555]:   notice: State transition S_ELECTION -> 
S_INTEGRATION
Mar 12 18:15:23 sid1 crmd[555]:  warning: Input I_ELECTION_DC received in state 
S_INTEGRATION from do_election_check
Mar 12 18:15:23 sid1 crmd[555]:   notice: Result of probe operation for 
stonith-sbd on sid1: 7 (not running)
Mar 12 18:15:23 sid1 crmd[555]:   notice: Result of probe operation for dlm on 
sid1: 7 (not running)
Mar 12 18:15:23 sid1 crmd[555]:   notice: Result of probe operation for 
admin-ip on sid1: 7 (not running)
Mar 12 18:15:23 sid1 crmd[555]:   notice: Result of probe operation for 
clusterfs on sid1: 7 (not running)
Mar 12 18:15:57 sid1 stonith-ng[551]:   notice: Operation 'reboot' [1454] (call 
2 from crmd.555) for host 'sid2' with device 'stonith-sbd' returned: 0 (OK)
Mar 12 18:15:57 sid1 stonith-ng[551]:   notice: Operation reboot of sid2 by 
sid1 for crmd.555@sid1.ece4f9c5: OK
Mar 12 18:15:57 sid1 crmd[555]:   notice: Node sid2 state is now lost
Mar 12 18:15:58 sid1 crmd[555]:   notice: Result of start operation for dlm on 
sid1: 0 (ok)
Mar 12 18:15:58 sid1 crmd[555]:   notice: Result of start operation for 
admin-ip on sid1: 0 (ok)
Mar 12 18:15:58 sid1 crmd[555]:   notice: Result of start operation for 
stonith-sbd on sid1: 0 (ok)
Mar 12 18:15:58 sid1 crmd[555]:   notice: Result of start operation for 
clusterfs on sid1: 0 (ok)
Mar 12 18:15:58 sid1 crmd[555]:   notice: Transition 0 (Complete=18, Pending=0, 
Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-warn-32.bz2): Complete
Mar 12 18:15:58 sid1 crmd[555]:   notice: State transition S_TRANSITION_ENGINE 
-> S_IDLE

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Valentin Vidic
On Mon, Mar 12, 2018 at 01:58:21PM +0100, Klaus Wenninger wrote:
> But isn't dlm directly interfering with corosync so
> that it would get the quorum state from there?
> As you have 2-node set probably on a 2-node-cluster
> this would - after both nodes down - wait for all
> nodes up first.

Isn't wait_for_all only used during cluster installation?

votequorum(5):

"When WFA is enabled, the cluster will be quorate for the first time
only after all nodes have been visible at least once at the same time."

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] trouble with IPaddr2

2017-10-12 Thread Valentin Vidic
On Wed, Oct 11, 2017 at 02:36:24PM +0200, Valentin Vidic wrote:
> AFAICT, it found a better interface with that subnet and tried
> to use it instead of the one specified in the parameters :)
> 
> But maybe IPaddr2 should just skip interface auto-detection
> if an explicit interface was given in the parameters?

Oh it seems you specified nic only for the monitor operation so
it would fallback to auto-detect for start and stop actions:

primitive HA_IP-Serv1 IPaddr2 \
params ip=172.16.101.70 cidr_netmask=16 \
op monitor interval=20 timeout=30 on-fail=restart nic=bond0 \
meta target-role=Started

So you probably wanted this instead:

primitive HA_IP-Serv1 IPaddr2 \
params ip=172.16.101.70 cidr_netmask=16 nic=bond0 \
op monitor interval=20 timeout=30 on-fail=restart \
meta target-role=Started

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] trouble with IPaddr2

2017-10-11 Thread Valentin Vidic
On Wed, Oct 11, 2017 at 01:29:40PM +0200, Stefan Krueger wrote:
> ohh damn.. thanks a lot for this hint.. I delete all the IPs on enp4s0f0, and 
> than it works..
> but could you please explain why it now works? why he has a problem with this 
> IPs?

AFAICT, it found a better interface with that subnet and tried
to use it instead of the one specified in the parameters :)

But maybe IPaddr2 should just skip interface auto-detection
if an explicit interface was given in the parameters?

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] trouble with IPaddr2

2017-10-11 Thread Valentin Vidic
On Wed, Oct 11, 2017 at 10:51:04AM +0200, Stefan Krueger wrote:
> primitive HA_IP-Serv1 IPaddr2 \
> params ip=172.16.101.70 cidr_netmask=16 \
> op monitor interval=20 timeout=30 on-fail=restart nic=bond0 \
> meta target-role=Started

There might be something wrong with the network setup because enp4s0f0
gets used instead of bond0:

> Oct 11 08:19:32 zfs-serv2 IPaddr2(HA_IP-Serv1)[27672]: INFO: Adding inet 
> address 172.16.101.70/16 with broadcast address 172.16.255.255 to device 
> enp4s0f0

Can you share more info on the network of zfs-serv2, for example: ip a?

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync service not automatically started

2017-10-10 Thread Valentin Vidic
On Tue, Oct 10, 2017 at 11:26:24AM +0200, Václav Mach wrote:
> # The primary network interface
> allow-hotplug eth0
> iface eth0 inet dhcp
> # This is an autoconfigured IPv6 interface
> iface eth0 inet6 auto

allow-hotplug or dhcp could be causing problems.  You can try
disabling corosync and pacemaker so they don't start on boot
and start them manually after a few minutes when the network
is stable.  If it works than you have some kind of a timing
issue.  You can try using 'auto eth0' or a static IP address
to see if it helps...

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync service not automatically started

2017-10-10 Thread Valentin Vidic
On Tue, Oct 10, 2017 at 10:35:17AM +0200, Václav Mach wrote:
> Oct 10 10:27:05 r1nren.et.cesnet.cz corosync[709]:   [QB] Denied
> connection, is not ready (709-1337-18)
> Oct 10 10:27:06 r1nren.et.cesnet.cz corosync[709]:   [QB] Denied
> connection, is not ready (709-1337-18)
> Oct 10 10:27:07 r1nren.et.cesnet.cz corosync[709]:   [QB] Denied
> connection, is not ready (709-1337-18)
> Oct 10 10:27:08 r1nren.et.cesnet.cz corosync[709]:   [QB] Denied
> connection, is not ready (709-1337-18)
> Oct 10 10:27:09 r1nren.et.cesnet.cz corosync[709]:   [QB] Denied
> connection, is not ready (709-1337-18)

Could it be that the network or the firewall takes some time to start
on boot?

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PostgreSQL Automatic Failover (PAF) v2.2.0

2017-10-05 Thread Valentin Vidic
On Thu, Oct 05, 2017 at 08:55:59PM +0200, Jehan-Guillaume de Rorthais wrote:
> It doesn't seems impossible, however I'm not sure of the complexity around
> this.
> 
> You would have to either hack PAF and detect failover/migration or create a 
> new
> RA that would always be part of the transition implying your PAF RA to define
> if it is moving elsewhere or not. 
> 
> It feels the complexity is quite high and would require some expert advices
> about Pacemaker internals to avoid wrong or unrelated behaviors or race
> conditions.
> 
> But, before going farther, you need to realize a failover will never be
> transparent. Especially one that would trigger randomly outside of your 
> control.

Yes, I was thinking more about manual failover, for example to upgrade
the postgresql master.  RA for pgbouncer would wait for all active
queries to finish and queue all new queries.  Once there is nothing
running on the master anymore, another slave is activated and pgbouncer
would than resume queries there.

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PostgreSQL Automatic Failover (PAF) v2.2.0

2017-10-05 Thread Valentin Vidic
On Tue, Sep 12, 2017 at 04:48:19PM +0200, Jehan-Guillaume de Rorthais wrote:
> PostgreSQL Automatic Failover (PAF) v2.2.0 has been released on September
> 12th 2017 under the PostgreSQL licence.
> 
> See: https://github.com/dalibo/PAF/releases/tag/v2.2.0
> 
> PAF is a PostgreSQL resource agent for Pacemaker. Its original aim is to
> keep it clear between the Pacemaker administration and the PostgreSQL one, to
> keep things simple, documented and yet powerful.

Do you think it might be possible to integrate the PostgreSQL
replication with pgbouncer for a transparent failover? The idea
would be to pause the clients in pgbouncer while moving the
replication master so no queries would fail.

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

2017-09-11 Thread Valentin Vidic
On Mon, Sep 11, 2017 at 04:18:08PM +0200, Klaus Wenninger wrote:
> Just for my understanding: You are using watchdog-handling in corosync?

Corosync package in Debian gets build with --enable-watchdog so by
default it takes /dev/watchdog during runtime.  Don't think SUSE
or RedHat packages get built with --enable-watchdog so this behavior
is disabled there.

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

2017-09-10 Thread Valentin Vidic
On Sun, Sep 10, 2017 at 08:27:47AM +0200, Ferenc Wágner wrote:
> Confirmed: setting watchdog_device: off cluster wide got rid of the
> above warnings.

Interesting, what brand or version of IPMI has this problem?

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] XenServer guest and host watchdog

2017-09-09 Thread Valentin Vidic
On Fri, Sep 08, 2017 at 09:39:26PM +0100, Andrew Cooper wrote:
> Yes.  The internal mechanism of the host watchdog is to use one
> performance counter to count retired instructions and generate an NMI
> roughly once every half second (give or take C and P states).
> 
> Separately, there is a one second timer (the same framework as all other
> timers in Xen, including the guest watchdog), which triggers a softirq
> (lower priority, runs on the return-to-guest path), which increments a
> local variable.  If the NMI handler doesn't observe this local variable
> incrementing in the timeout period, Xen crash the entire system.

Thanks for the explanation.  And in addition to the software guest
and host watchdogs, an external watchdog device like ipmi_watchdog
or iTCO_wdt can be used inside dom0.

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] XenServer guest and host watchdog

2017-09-08 Thread Valentin Vidic
On Fri, Sep 08, 2017 at 12:57:12PM +, Mark Syms wrote:
> As we discussed regarding the handling of watchdog in XenServer, both
> guest and host, I've had a discussion with our subject matter expert
> (Andrew, cc'd) on this topic. The guest watchdogs are handled by a
> hardware timer in the hypervisor but if the timers themselves are not
> serviced within 5 seconds the host watchdog will fire and pull the
> host down.

I presume the host watchdog is the NMI watchdog described in the
Xen Hypervisor Command Line Options?

watchdog = force |  (Default: false)
Run an NMI watchdog on each processor. If a processor is stuck for
longer than the watchdog_timeout, a panic occurs. When force is
specified, in addition to running an NMI watchdog on each processor,
unknown NMIs will still be processed.

watchdog_timeout =  (Default: 5)
Set the NMI watchdog timeout in seconds. Specifying 0 will turn off the
watchdog.

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Valentin Vidic
On Mon, Jul 24, 2017 at 10:38:40AM -0500, Ken Gaillot wrote:
> Standby is not necessary, it's just a cautious step that allows the
> admin to verify that all resources moved off correctly. The restart that
> yum does should be sufficient for pacemaker to move everything.
> 
> A restart shouldn't lead to fencing in any case where something's not
> going seriously wrong. I'm not familiar with the "kernel is using it"
> message, I haven't run into that before.

Right, pacemaker upgrade might not be the biggest problem.  I've seen
other packages upgrades cause RA monitors to return results like 
$OCF_NOT_RUNNING or $OCF_ERR_INSTALLED.  This of course causes the
cluster to react, so I prefer the node standby option :)

In this case the pacemaker was trying to stop the resources, the stop
action has failed and the upgrading node was killed off by the second
node trying to cleanup the mess.  The resources should have come up
on the second node after that.

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Valentin Vidic
On Mon, Jul 24, 2017 at 11:01:26AM -0500, Dimitri Maziuk wrote:
> Lsof/fuser show the PID of the process holding FS open as "kernel".

That could be the NFS server running in the kernel.

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-23 Thread Valentin Vidic
On Sun, Jul 23, 2017 at 07:27:03AM -0500, Dmitri Maziuk wrote:
> So yesterday I ran yum update that puled in the new pacemaker and tried to
> restart it. The node went into its usual "can't unmount drbd because kernel
> is using it" and got stonith'ed in the middle of yum transaction. The end
> result: DRBD reports split brain, HA daemons don't start on boot, RPM
> database is FUBAR. I've had enough. I'm rebuilding this cluster as centos 6
> + heartbeat R1.

It seems you did not put the node into standby before the upgrade as it
still had resources running.  What was the old/new pacemaker version there?

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Coming in Pacemaker 1.1.17: container bundles

2017-07-01 Thread Valentin Vidic
On Fri, Jun 30, 2017 at 12:46:29PM -0500, Ken Gaillot wrote:
> The challenge is that some properties are docker-specific and other
> container engines will have their own specific properties.
> 
> We decided to go with a tag for each supported engine -- so if we add
> support for rkt, we'll add a  tag with whatever properties it
> needs. Then a  would need to contain either a  tag or a
>  tag.
> 
> We did consider a generic alternative like:
> 
>   
>  
>  
>  ...
>  
>  ...
>   
> 
> But it was decided that using engine-specific tags would allow for
> schema enforcement, and would be more readable.
> 
> The  and  tags were kept under  because we
> figured those are essential to the concept of a bundle, and any engine
> should support some way of mapping those.

Thanks for the explanation, it makes sense :)

Now I have a working rkt resource agent and would like to test it.
Can you share the pcmk:httpd image mentioned in the docker example?

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Coming in Pacemaker 1.1.17: container bundles

2017-06-30 Thread Valentin Vidic
On Fri, Mar 31, 2017 at 05:43:02PM -0500, Ken Gaillot wrote:
> Here's an example of the CIB XML syntax (higher-level tools will likely
> provide a more convenient interface):
> 
>  
> 
>   

Would it be possible to make this a bit more generic like:

  

so we have support for other container engines like rkt?

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] simple active/active router using pacemaker+corosync

2017-01-26 Thread Valentin Vidic
On Thu, Jan 26, 2017 at 12:10:24PM +0100, Arturo Borrero Gonzalez wrote:
> I have a rather simple 2 nodes active/active router using pacemaker+corosync.
> 
> Why active-active? Well, one node holds the virtual IPv4 resources and
> the other node holds the virtual IPv6 resources.
> On failover, both nodes are able to run all the virtual IPv4/IPv6 addresses.
> 
> We have about 30 resources configured, and more will be added in the future.

You may need to check some pacemaker limits for this number of resources:

* batch-limit (30)
The number of jobs that the Transition Engine (TE) is allowed
to execute in parallel. The TE is the logic in pacemaker’s CRMd that executes
the actions determined by the Policy Engine (PE). The "correct" value will
depend on the speed and load of your network and cluster nodes.

* migration-limit (-1)
The number of migration jobs that the TE is allowed to
execute in parallel on a node. A value of -1 means unlimited. 

> The problems/questions are:
> 
>  * The IPv6addr resource agent is so slow. I guess that's because of
> the additional checks (pings). I had to switch to IPaddr2 for the
> virtual IPv6 resources as well, which improves the failover times a
> bit. Is this expected? Any hint here?

Can you check how slow it is?  It should take 5 seconds to send
advertisments so the whole move takes 6-7 seconds which seems resonable
to me.  The address should be functional most of that time.

>  * In order to ease management, I created 2 groups, one for all the
> IPv4 addresses and other for all the IPv6 addresses. This way, I can
> perform operations (such as movements, start/stop) for all the
> resources in one go. This has a known drawback: in a group, the
> resources are managed in chain by the order of the group. On failover,
> this really hurts the movement time, since resources aren't moved in
> parallel but sequentially. Any hint here?
> 
> I would like to have a simple way of managing lot of resources in one
> go, but without the ordering drawbacks of a group.

Guess you could create a Dummy resource and make INIFINITY colloction
constraints for the IPs so they follow Dummy as it moves between the
nodes :)

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] eventmachine gem in pcsd

2016-06-30 Thread Valentin Vidic
On Thu, Jun 30, 2016 at 01:27:25PM +0200, Tomas Jelinek wrote:
> It seems eventmachine can be safely dropped as all tests passed without it.

Great, thanks for confirming.

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pcs testsuite status

2016-06-29 Thread Valentin Vidic
On Wed, Jun 29, 2016 at 10:31:42AM +0200, Tomas Jelinek wrote:
> This should be replaceable by any agent which does not provide unfencing,
> i.e. it does not have on_target="1" automatic="1" attributes in  name="on" /> . You may need to experiment with few agents to find one which
> works.

Just changed fence_xvm to fence_dummy and the tests pass with that.

-- 
Valentin

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] pcs testsuite status

2016-06-28 Thread Valentin Vidic
I'm trying to run pcs tests on Debian unstable, but there
are some strange failures like diffs failing due to an
additional space at the end of the line or just with
"Error: cannot load cluster status, xml does not conform to the schema"

Any idea what could be the issue here?  I assume the tests
work on RHEL7 so the problem might be with the package
versions I'm using:

pacemaker: 1.1.15~rc3-2
corosync: 2.3.6-1
pcs: 0.9.152-1

FAIL: testNodeStandby (pcs.test.test_cluster.ClusterTest)
--
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/pcs/test/test_cluster.py", line 45, in 
testNodeStandby
ac(output, "")
  File "/usr/lib/python2.7/dist-packages/pcs/test/tools/misc.py", line 31, in ac
"strings not equal:\n{0}".format(prepare_diff(b, a))
AssertionError: strings not equal:
+ Error: cannot load cluster status, xml does not conform to the schema


==
FAIL: testFenceLevels (pcs.test.test_stonith.StonithTest)
--
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/pcs/test/test_stonith.py", line 374, 
in testFenceLevels
assert returnVal == 0
AssertionError

==
FAIL: testStonithCreation (pcs.test.test_stonith.StonithTest)
--
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/pcs/test/test_stonith.py", line 161, 
in testStonithCreation
""")
  File "/usr/lib/python2.7/dist-packages/pcs/test/tools/misc.py", line 31, in ac
"strings not equal:\n{0}".format(prepare_diff(b, a))
AssertionError: strings not equal:
+ Error: cannot load cluster status, xml does not conform to the schema
  Cluster Name: test99
- Corosync Nodes:
-  rh7-1 rh7-2
- Pacemaker Nodes:
- 
- Resources:
- 
- Stonith Devices:
-  Resource: test1 (class=stonith type=fence_noxist)
-   Operations: monitor interval=60s (test1-monitor-interval-60s)
-  Resource: test2 (class=stonith type=fence_ilo)
-   Operations: monitor interval=60s (test2-monitor-interval-60s)
-  Resource: test3 (class=stonith type=fence_ilo)
-   Attributes: ipaddr=test login=testA
-   Operations: monitor interval=60s (test3-monitor-interval-60s)
-  Resource: test-fencing (class=stonith type=fence_apc)
-   Attributes: pcmk_host_list="rhel7-node1
-   Operations: monitor interval=61s (test-fencing-monitor-interval-61s)
- Fencing Levels:
- 
- Location Constraints:
- Ordering Constraints:
- Colocation Constraints:
- Ticket Constraints:
- 
- Resources Defaults:
-  No defaults set
- Operations Defaults:
-  No defaults set
- 
- Cluster Properties:
- 
- Quorum:
-   Options:


==
FAIL: testStonithDeleteRemovesLevel (pcs.test.test_stonith.StonithTest)
--
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/pcs/test/test_stonith.py", line 665, 
in testStonithDeleteRemovesLevel
self.assertEqual(returnVal, 0)
AssertionError: 1 != 0

==
FAIL: test_stonith_create_provides_unfencing (pcs.test.test_stonith.StonithTest)
--
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/pcs/test/test_stonith.py", line 193, 
in test_stonith_create_provides_unfencing
ac(output, "")
  File "/usr/lib/python2.7/dist-packages/pcs/test/tools/misc.py", line 31, in ac
"strings not equal:\n{0}".format(prepare_diff(b, a))
AssertionError: strings not equal:
+ Error: Agent 'fence_xvm' not found, use --force to override


==
FAIL: test_node_maintenance (pcs.test.test_node.NodeTest)
--
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/pcs/test/test_node.py", line 31, in 
test_node_maintenance
ac("", output)
  File "/usr/lib/python2.7/dist-packages/pcs/test/tools/misc.py", line 31, in ac
"strings not equal:\n{0}".format(prepare_diff(b, a))
AssertionError: strings not equal:
- Error: cannot load cluster status, xml does not conform to the schema


==
FAIL: test_node_standby (pcs.test.test_node.NodeTest)
--
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/pcs/test/test_node.py", line 145, in 
test_node_standby
ac(output, "")
  File "/usr/lib/python2.7/dist-packages/pcs/test/tools/misc.py", line 31, in ac
"strings not 

Re: [ClusterLabs] dlm_controld 4.0.4 exits when crmd is fencing another node

2016-04-26 Thread Valentin Vidic
On Fri, Jan 22, 2016 at 07:57:52PM +0300, Vladislav Bogdanov wrote:
> Tried reverting this one and a51b2bb ("If an error occurs unlink the 
> lock file and exit with status 1") one-by-one and both together, the 
> same result.
> 
> So problem seems to be somewhere deeper.

I've got the same fencing problem with dlm-4.0.4 on Debian.  Looking
at the strace of the dlm_controld process it exits right after returning
from the poll call due to SIGCHLD signal:

wait4(2279, 0x7ffd2f468afc, WNOHANG, NULL) = 0
poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, 
{fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=14, 
events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, 
events=POLLIN}], 10, 1000) = 0 (Timeout)
wait4(2279, 0x7ffd2f468afc, WNOHANG, NULL) = 0
poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, 
{fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=14, 
events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, 
events=POLLIN}], 10, 1000) = 0 (Timeout)
wait4(2279, 0x7ffd2f468afc, WNOHANG, NULL) = 0
poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, 
{fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=14, 
events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, 
events=POLLIN}], 10, 1000) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2279, si_uid=0, 
si_status=0, si_utime=0, si_stime=0} ---
rt_sigreturn()  = -1 EINTR (Interrupted system call)
close(11)   = 0
sendto(10, "\240", 1, MSG_NOSIGNAL, NULL, 0) = 1
sendto(17, "\20", 1, MSG_NOSIGNAL, NULL, 0) = 1
poll([{fd=17, events=POLLIN}], 1, 0)= 0 (Timeout)
shutdown(17, SHUT_RDWR) = 0
close(17)   = 0
munmap(0x7f5f45c26000, 2105344) = 0
munmap(0x7f5f4aeea000, 8248)= 0
munmap(0x7f5f45a24000, 2105344) = 0
munmap(0x7f5f4aee7000, 8248)= 0
munmap(0x7f5f45822000, 2105344) = 0

and in fact there is a recent change in 4.0.4 modifying that part
of code:

  If an error occurs unlink the lock file and exit with status 1
  
https://git.fedorahosted.org/cgit/dlm.git/commit/?id=a51b2bbe413222829778698e62af88a73ebec233

The bug is caused by the missing braces in the expanded if
statement.

Do you think we can get a new version out with this patch as the
fencing in 4.0.4 does not work properly due to this issue?

-- 
Valentin
Index: dlm-4.0.4/dlm_controld/main.c
===
--- dlm-4.0.4.orig/dlm_controld/main.c
+++ dlm-4.0.4/dlm_controld/main.c
@@ -1028,9 +1028,10 @@ static int loop(void)
 	for (;;) {
 		rv = poll(pollfd, client_maxi + 1, poll_timeout);
 		if (rv == -1 && errno == EINTR) {
-			if (daemon_quit && list_empty())
+			if (daemon_quit && list_empty()) {
 rv = 0;
 goto out;
+			}
 			if (daemon_quit) {
 log_error("shutdown ignored, active lockspaces");
 daemon_quit = 0;
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org