Re: [ClusterLabs] Corosync crash
On Tue, May 07, 2019 at 09:59:03AM +0300, Klecho wrote: > During the weekend my corosync daemon suddenly died without anything in the > logs, except this: > > May 5 20:39:16 ZZZ kernel: [1605277.136049] traps: corosync[2811] trap > invalid opcode ip:5635c376f2eb sp:7ffc3e109950 error:0 in > corosync[5635c3745000+47000] > > The version is corosync 2.4.4-3 amd64 standard Debian stretch Don't recall seeing crashes like than with the stretch release, any chance you can run a memtest on that machine? -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Issue with DB2 HADR cluster
On Wed, Apr 03, 2019 at 09:13:58AM +0200, Ulrich Windl wrote: > I'm surprised: Once sbd writes the fence command, it usually takes > less than 3 seconds until the victim is dead. If you power off a > server, the PDU still may have one or two seconds "power reserve", so > the host may not be down immediately. Besides of that power-cycles are > additional stress for the hardware... > > So maybe you want to explain why and how much faster IPMI and PDU fencing are. SBD is slow for me too. Since it doesn't have a way to confirm the kill it needs to wait for various timeouts and these can be quite high. For example the IBM storage timeouts require this setup: Timeout (watchdog) : 130 Timeout (msgwait) : 270 On the same cluster IPMI fence executes in a second or two, but requires network connectivity. -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Question on sharing data with DRDB
On Wed, Mar 20, 2019 at 07:31:02PM +0100, Valentin Vidic wrote: > Right, but I'm not sure how this would help in the above situation > unless the DRBD can undo the local write that did not succeed on the > peer? Ah, it seems the activity log handles the undo by storing the location of these dirty blocks (not replicated properly): https://docs.linbit.com/docs/users-guide-8.4/#s-activity-log So on resync these blocks would get copied from the new primary. -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Question on sharing data with DRDB
On Wed, Mar 20, 2019 at 02:01:07PM -0400, Digimer wrote: > On 2019-03-20 2:00 p.m., Valentin Vidic wrote: > > On Wed, Mar 20, 2019 at 01:47:56PM -0400, Digimer wrote: > >> Not when DRBD is configured correctly. You sent 'fencing > >> resource-and-stonith;' and set the appropriate fence handler. This tells > >> DRBD to not proceed with a write while a node is in an unknown state > >> (which happens when the node stops responding and is cleared on > >> successful fence). > > > > The situation I had in mind is this: node1 is Primary and sends a write to > > local disk and over the DRBD link. The network fails so the write is > > successfull only on the local disk. node1 is than fenced and node2 takes > > over but the disk have now diverged. > > This is handled by using Protocol C in DRBD. That tells DRBD to not > consider a write complete until it has hit persistent storage on both nodes. Right, but I'm not sure how this would help in the above situation unless the DRBD can undo the local write that did not succeed on the peer? -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Question on sharing data with DRDB
On Wed, Mar 20, 2019 at 01:47:56PM -0400, Digimer wrote: > Not when DRBD is configured correctly. You sent 'fencing > resource-and-stonith;' and set the appropriate fence handler. This tells > DRBD to not proceed with a write while a node is in an unknown state > (which happens when the node stops responding and is cleared on > successful fence). The situation I had in mind is this: node1 is Primary and sends a write to local disk and over the DRBD link. The network fails so the write is successfull only on the local disk. node1 is than fenced and node2 takes over but the disk have now diverged. -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Question on sharing data with DRDB
On Wed, Mar 20, 2019 at 01:44:06PM -0400, Digimer wrote: > GFS2 notified the peers of disk changes, and DRBD handles actually > copying to changes to the peer. > > Think of DRBD, in this context, as being mdadm RAID, like how writing to > /dev/md0 is handled behind the scenes to write to both /dev/sda3 + > /dev/sdb3. DRBD is like the same, any writes to /dev/drbd0 is written to > both node1:/dev/sda3 + node2:/dev/sda3. > > So DRBD handles replication, and GFS2 handles coordination. Yes, I was thinking more of the GFS2 in the shared storage setup, how much overhead is there if the cluster nodes all write to different files like VM images? -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Question on sharing data with DRDB
On Wed, Mar 20, 2019 at 01:34:52PM -0400, Digimer wrote: > Depending on your fail-over tolerances, I might add NFS to the mix and > have the NFS server run on one node or the other, exporting your ext4 FS > that sits on DRBD in single-primary mode. > > The failover (if the NFS host died) would look like this; > > 1. Lost node is fenced. > 2. DRBD is promoted from Secondary to Primary > 3. ext4 FS is mounted. > 4. Virtual IP (used for NFS) is brought up. > 5. NFS starts > > Startup and graceful migration would be the same, minus the fence. Would it be possible for DRBD to go into SplitBrain if the lost node manages to write something to local DRBD disk before it gets fenced? -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Question on sharing data with DRDB
On Wed, Mar 20, 2019 at 12:37:21PM -0400, Digimer wrote: > Cluster filesystems are amazing if you need them, and to be avoided if > at all possible. The overhead from the cluster locking hurts performance > quite a lot, and adds a non-trivial layer of complexity. > > I say this as someone who has used dual-primary DRBD with GFS2 for > many years. If the GFS2 holds qcow2 images, does node1 need to synchronize writes to vm1.qcow2 with node2 writing to vm2.qcow2? -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Question on sharing data with DRDB
On Wed, Mar 20, 2019 at 09:36:58AM -0600, JCA wrote: > # pcs -f fs_cfg resource create TestFS Filesystem device="/dev/drbd1" > directory="/tmp/Testing" > fstype="ext4" ext4 can only be mounted on one node at a time. If you need to access files on both nodes at the same time than a cluster filesystem should be used (GFS2, OCFS2). -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
On Sat, Feb 16, 2019 at 10:23:17PM +, Eric Robinson wrote: > I'm looking through the docs but I don't see how to set the on-fail value for > a resource. It is not set on the resource itself but on each of the actions (monitor, start, stop). -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
On Sat, Feb 16, 2019 at 09:33:42PM +, Eric Robinson wrote: > I just noticed that. I also noticed that the lsb init script has a > hard-coded stop timeout of 30 seconds. So if the init script waits > longer than the cluster resource timeout of 15s, that would cause the Yes, you should use higher timeouts in pacemaker (45s for example). > resource to fail. However, I don't want cluster failover to be > triggered by the failure of one of the MySQL resources. I only want > cluster failover to occur if the filesystem or drbd resources fail, or > if the cluster messaging layer detects a complete node failure. Is > there a way to tell PaceMaker not to trigger cluster failover if any > of the p_mysql resources fail? You can try playing with the on-fail option but I'm not sure how reliably this whole setup will work without some form of fencing/stonith. https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
On Sat, Feb 16, 2019 at 09:03:43PM +, Eric Robinson wrote: > Here are the relevant corosync logs. > > It appears that the stop action for resource p_mysql_002 failed, and > that caused a cascading series of service changes. However, I don't > understand why, since no other resources are dependent on p_mysql_002. The stop failed because of a timeout (15s), so you can try to update that value: Result of stop operation for p_mysql_002 on 001db01a: Timed Out | call=1094 key=p_mysql_002_stop_0 timeout=15000ms After the stop failed it should have fenced that node, but you don't have fencing configured so it tries to move mysql_002 and all the other resources related to it (vip, fs, drbd) to the other node. Since other mysql resources depend on the same (vip, fs, drbd) they need to be stopped first. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
On Sat, Feb 16, 2019 at 08:50:57PM +, Eric Robinson wrote: > Which logs? You mean /var/log/cluster/corosync.log? On the DC node pacemaker will be logging the actions it is trying to run (start or stop some resources). > But even if the stop action is resulting in an error, why would the > cluster also try to stop the other services which are not dependent? When the resource is failed, pacemaker might still try to run stop for that resource. If the lsb script is not correct that might also stop other mysql resources. But this should all be reported in the pacemaker log. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
On Sat, Feb 16, 2019 at 08:34:21PM +, Eric Robinson wrote: > Why is it that when one of the resources that start with p_mysql_* > goes into a FAILED state, all the other MySQL services also stop? Perhaps stop is not working correctly for these lsb services, so for example stopping lsb:mysql_004 also stops the other lsb:mysql_nnn. You would need to send the logs from the event to confirm this. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Announcing hawk-apiserver, now in ClusterLabs
On Tue, Feb 12, 2019 at 08:00:38PM +0100, Kristoffer Grönlund wrote: > One final note: hawk-apiserver uses a project called go-pacemaker > located at https://github.com/krig/go-pacemaker. I indend to transfer > this to ClusterLabs as well. go-pacemaker is still somewhat rough around > the edges, and our plan is to work on the C API of pacemaker to make > using and exposing it via Go easier, as well as moving functionality > from crm_mon into the C API so that status information can be made > available in a more convenient format via the API as well. So no Lisp here? Just kidding, great work and looking forward to trying it out :) -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Trying to Understanding crm-fence-peer.sh
On Wed, Jan 16, 2019 at 04:20:03PM +0100, Valentin Vidic wrote: > I think drbd always calls crm-fence-peer.sh when it becomes disconnected > primary. In this case storage1 has closed the DRBD connection and > storage2 has become a disconnected primary. > > Maybe the problem is the order that the services are stopped during > reboot. It would seem that drbd is shutdown before pacemaker. You > can try to run manually: > > pacemaker stop > corosync stop > drbd stop > > and see what happens in this case. Some more info here: https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_drbd_fencing.html So storage2 does not know why the other end disappeared and tries to use pacemaker to prevent storage1 from ever becoming a primary. Only when it comes back online and gets in sync it is allowed to start again as a pacemaker resource by a second script: after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Trying to Understanding crm-fence-peer.sh
On Wed, Jan 16, 2019 at 09:03:21AM -0600, Bryan K. Walton wrote: > The exit code 4 would seem to suggest that storage1 should be fenced. > But the switch ports connected to storage1 are still enabled. > > Am I misreading the logs here? This is a clean reboot, maybe fencing > isn't supposed to happen in this situation? But the logs seem to > suggest otherwise. I think drbd always calls crm-fence-peer.sh when it becomes disconnected primary. In this case storage1 has closed the DRBD connection and storage2 has become a disconnected primary. Maybe the problem is the order that the services are stopped during reboot. It would seem that drbd is shutdown before pacemaker. You can try to run manually: pacemaker stop corosync stop drbd stop and see what happens in this case. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Unexpected resource restart
On Wed, Jan 16, 2019 at 12:41:11PM +0100, Valentin Vidic wrote: > This is what pacemaker says about the resource restarts: > > Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start dlm:1 > ( node2 ) > Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start > lockd:1 ( node2 ) > Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Restart > gfs2-lvm:0 ( node1 ) due to required storage-clone running > Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Restart > gfs2-fs:0 ( node1 ) due to required gfs2-lvm:0 start > Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start > gfs2-lvm:1 ( node2 ) > Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start > gfs2-fs:1 ( node2 ) > Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Restart > ocfs2-lvm:0 ( node1 ) due to required storage-clone running > Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Restart > ocfs2-fs:0 ( node1 ) due to required ocfs2-lvm:0 start > Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start > ocfs2-lvm:1 ( node2 ) > Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start > ocfs2-fs:1 ( node2 ) It seems interleave was required an gfs2 and ocfs2 clones: interleave (default: false) If this clone depends on another clone via an ordering constraint, is it allowed to start after the local instance of the other clone starts, rather than wait for all instances of the other clone to start? Now it behaves as expected when the node2 is set online: Jan 16 12:35:33 node1 pacemaker-schedulerd[564]: notice: * Start dlm:1 ( node2 ) Jan 16 12:35:33 node1 pacemaker-schedulerd[564]: notice: * Start lockd:1 ( node2 ) Jan 16 12:35:33 node1 pacemaker-schedulerd[564]: notice: * Start gfs2-lvm:1 ( node2 ) Jan 16 12:35:33 node1 pacemaker-schedulerd[564]: notice: * Start gfs2-fs:1 ( node2 ) Jan 16 12:35:33 node1 pacemaker-schedulerd[564]: notice: * Start ocfs2-lvm:1 ( node2 ) Jan 16 12:35:33 node1 pacemaker-schedulerd[564]: notice: * Start ocfs2-fs:1 ( node2 ) Clone: gfs2-clone Meta Attrs: interleave=true target-role=Started Group: gfs2 Resource: gfs2-lvm (class=ocf provider=heartbeat type=LVM-activate) Attributes: activation_mode=shared vg_access_mode=lvmlockd vgname=vgshared lvname=gfs2 Resource: gfs2-fs (class=ocf provider=heartbeat type=Filesystem) Attributes: directory=/srv/gfs2 fstype=gfs2 device=/dev/vgshared/gfs2 -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Unexpected resource restart
On Wed, Jan 16, 2019 at 12:16:04PM +, Andrew Price wrote: > The only thing that stands out to me with this config is the lack of > ordering constraint between dlm and lvmlockd. Not sure if that's the issue > though. They are both in the storage group so the order should be dlm than lockd? Clone: storage-clone Meta Attrs: interleave=true target-role=Started Group: storage Resource: dlm (class=ocf provider=pacemaker type=controld) Resource: lockd (class=ocf provider=heartbeat type=lvmlockd) -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Unexpected resource restart
On Wed, Jan 16, 2019 at 12:28:59PM +0100, Valentin Vidic wrote: > When node2 is set to standby resource stop running there. However when > node2 is brought back online, it causes the resources on node1 to stop > and than start again which is a bit unexpected? > > Maybe the dependency between the common storage group and the upper > gfs2/ocfs2 groups could be written in some other way to prevent this > resource restart? This is what pacemaker says about the resource restarts: Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start dlm:1 ( node2 ) Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start lockd:1 ( node2 ) Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Restart gfs2-lvm:0 ( node1 ) due to required storage-clone running Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Restart gfs2-fs:0 ( node1 ) due to required gfs2-lvm:0 start Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start gfs2-lvm:1 ( node2 ) Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start gfs2-fs:1 ( node2 ) Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Restart ocfs2-lvm:0 ( node1 ) due to required storage-clone running Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Restart ocfs2-fs:0 ( node1 ) due to required ocfs2-lvm:0 start Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start ocfs2-lvm:1 ( node2 ) Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start ocfs2-fs:1 ( node2 ) -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Unexpected resource restart
Hi all, I'm testing the following configuration with two nodes: Clone: storage-clone Meta Attrs: interleave=true target-role=Started Group: storage Resource: dlm (class=ocf provider=pacemaker type=controld) Resource: lockd (class=ocf provider=heartbeat type=lvmlockd) Clone: gfs2-clone Group: gfs2 Resource: gfs2-lvm (class=ocf provider=heartbeat type=LVM-activate) Attributes: activation_mode=shared vg_access_mode=lvmlockd vgname=vgshared lvname=gfs2 Resource: gfs2-fs (class=ocf provider=heartbeat type=Filesystem) Attributes: directory=/srv/gfs2 fstype=gfs2 device=/dev/vgshared/gfs2 Clone: ocfs2-clone Group: ocfs2 Resource: ocfs2-lvm (class=ocf provider=heartbeat type=LVM-activate) Attributes: activation_mode=shared vg_access_mode=lvmlockd vgname=vgshared lvname=ocfs2 Resource: ocfs2-fs (class=ocf provider=heartbeat type=Filesystem) Attributes: directory=/srv/ocfs2 fstype=ocfs2 device=/dev/vgshared/ocfs2 Ordering Constraints: storage-clone then gfs2-clone (kind:Mandatory) (id:gfs2_after_storage) storage-clone then ocfs2-clone (kind:Mandatory) (id:ocfs2_after_storage) Colocation Constraints: gfs2-clone with storage-clone (score:INFINITY) (id:gfs2_with_storage) ocfs2-clone with storage-clone (score:INFINITY) (id:ocfs2_with_storage) When node2 is set to standby resource stop running there. However when node2 is brought back online, it causes the resources on node1 to stop and than start again which is a bit unexpected? Maybe the dependency between the common storage group and the upper gfs2/ocfs2 groups could be written in some other way to prevent this resource restart? -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Status of Pacemaker 2 support in SBD?
On Fri, Jan 11, 2019 at 12:42:02PM +0100, wf...@niif.hu wrote: > I opened https://github.com/ClusterLabs/sbd/pull/62 with our current > patches, but I'm just a middle man here. Valentin, do you agree to > upstream these two remaining patches of yours? Sure thing, merge anything you can... -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Upgrading from CentOS 6 to CentOS 7
On Thu, Jan 03, 2019 at 04:56:26PM -0600, Ken Gaillot wrote: > Right -- not only that, but corosync 1 (CentOS 6) and corosync 2 > (CentOS 7) are not compatible for running in the same cluster. I suppose it is the same situation for upgrading from corosync 2 to corosync 3? -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops
On Tue, Nov 13, 2018 at 11:01:46AM -0600, Ken Gaillot wrote: > Clone instances have a default stickiness of 1 (instead of the usual 0) > so that they aren't needlessly shuffled around nodes every transition. > You can temporarily set an explicit stickiness of 0 to let them > rebalance, then unset it to go back to the default. Thanks, this works as expected now: clone cip-clone cip \ meta clone-max=2 clone-node-max=2 globally-unique=true interleave=true \ resource-stickiness=0 target-role=Started Clone instance moves when a node is down but also returns when the node is back online. Do you perhaps know if CLUSTERIP has any special network requirements to work properly? -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops
On Tue, Nov 13, 2018 at 05:04:19PM +0100, Valentin Vidic wrote: > Also it seems to require multicast, so better check for that too :) And while the CLUSTERIP resource seems to work for me in a test cluster, the following clone definition: clone cip-clone cip \ meta clone-max=2 clone-node-max=2 globally-unique=true interleave=true target-role=Started allows for both clone instances to end up on the same node: Clone Set: cip-clone [cip] (unique) cip:0 (ocf::heartbeat:IPaddr2): Started sid2 cip:1 (ocf::heartbeat:IPaddr2): Started sid2 Is there a way to spread the resources other than setting clone-node-max=1 for a while? -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops
On Tue, Nov 13, 2018 at 04:06:34PM +0100, Valentin Vidic wrote: > Could be some kind of ARP inspection going on in the networking equipment, > so check switch logs if you have access to that. Also it seems to require multicast, so better check for that too :) -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops
On Tue, Nov 13, 2018 at 09:06:56AM -0500, Daniel Ragle wrote: > Thanks, finally getting back to this. Putting a tshark on both nodes and > then restarting the VIP-clone resource shows the pings coming through for 12 > seconds, always on node2, then stop. I.E., before/after those 12 seconds > nothing on either node from the server initiating the pings. Could be some kind of ARP inspection going on in the networking equipment, so check switch logs if you have access to that. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [ClusterLabs Developers] resource-agents v4.2.0 rc1
On Fri, Oct 19, 2018 at 11:09:34AM +0200, Kristoffer Grönlund wrote: > I wonder if perhaps there was a configuration change as well, since the > return code seems to be configuration related. Maybe something changed > in the build scripts that moved something around? Wild guess, but... Seems to be a problem with the agent script in the end: https://github.com/ClusterLabs/resource-agents/pull/1254 -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] resource-agents v4.2.0 rc1
On Wed, Oct 17, 2018 at 12:03:18PM +0200, Oyvind Albrigtsen wrote: > - apache: retry PID check. I noticed that the ocft test started failing for apache in this version. Not sure if the test is broken or the agent. Can you check if the test still works for you? Restoring the previous version of the agent fixes the problem for me. # ocft test -v apache Initializing 'apache' ... Done. Starting 'apache' case 0 'check base env': ERROR: './apache monitor' failed, the return code is 2. Starting 'apache' case 1 'check base env: set non-existing OCF_RESKEY_statusurl': ERROR: './apache monitor' failed, the return code is 2. Starting 'apache' case 2 'check base env: set non-existing OCF_RESKEY_configfile': ERROR: './apache monitor' failed, the return code is 2. Starting 'apache' case 3 'normal start': ERROR: './apache monitor' failed, the return code is 2. Starting 'apache' case 4 'normal stop': ERROR: './apache monitor' failed, the return code is 2. Starting 'apache' case 5 'double start': ERROR: './apache monitor' failed, the return code is 2. Starting 'apache' case 6 'double stop': ERROR: './apache monitor' failed, the return code is 2. Starting 'apache' case 7 'running monitor': ERROR: './apache monitor' failed, the return code is 2. Starting 'apache' case 8 'not running monitor': ERROR: './apache monitor' failed, the return code is 2. Starting 'apache' case 9 'unimplemented command': ERROR: './apache monitor' failed, the return code is 2. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] LIO iSCSI target fails to start
On Wed, Oct 10, 2018 at 02:36:21PM +0200, Stefan K wrote: > I think my config is correct, but it sill fails with "This Target > already exists in configFS" but "targetcli ls" shows nothing. It seems to find something in /sys/kernel/config/target. Maybe it was setup outside of pacemaker somehow? -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops
On Thu, Oct 11, 2018 at 01:25:52PM -0400, Daniel Ragle wrote: > For the 12 second window it *does* work in, it appears as though it works > only on one of the two servers (and always the same one). My twelve seconds > of pings runs continuously then stops; while attempts to hit the Web server > works hit or miss depending on my source port (I'm using > sourceip-sourceport). I.E., as if anything that would be handled by the > other server isn't making it through. But after the 12 seconds neither > server responds to the requests against the VIP (but they do respond fine to > their own static IPs at all times). Could be that the switch in front of the servers does not like to see the same MAC on two ports or something like that. > During the 12 seconds that it works I get these in the logs of the server > that *is* responding: > > Oct 11 12:17:43 node2 kernel: ipt_CLUSTERIP: unknown protocol 1 > Oct 11 12:17:44 node2 kernel: ipt_CLUSTERIP: unknown protocol 1 > Oct 11 12:17:45 node2 kernel: ipt_CLUSTERIP: unknown protocol 1 Protocol 1 once per second should be ICMP PING so this is just CLUSTERIP complaining that it can't calculate sourceip-sourceport for those packets (ICMP has no source port). So maybe try recording the traffic using tcpdump on both servers and see if any requests are comming in at all from the network equipment. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [ClusterLabs Developers] fence-agents v4.3.0
On Tue, Oct 09, 2018 at 12:07:38PM +0200, Oyvind Albrigtsen wrote: > I've created a PR for the library detection and try/except imports: > https://github.com/ClusterLabs/fence-agents/pull/242 Thanks, I will give it a try right away... -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [ClusterLabs Developers] fence-agents v4.3.0
On Tue, Oct 09, 2018 at 10:55:08AM +0200, Oyvind Albrigtsen wrote: > It seems like the if-line should be updated to check for those 2 > libraries (from the imports in the agent). Yes, that might work too. Also would it be possible to make the imports in openstack agent conditional so the metadata works even when the python modules are not installed? Something like this in aliyun: try: from aliyunsdkcore import client from aliyunsdkecs.request.v20140526.DescribeInstancesRequest import DescribeInstancesRequest from aliyunsdkecs.request.v20140526.StartInstanceRequest import StartInstanceRequest from aliyunsdkecs.request.v20140526.StopInstanceRequest import StopInstanceRequest from aliyunsdkecs.request.v20140526.RebootInstanceRequest import RebootInstanceRequest except ImportError: pass -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [ClusterLabs Developers] fence-agents v4.3.0
On Tue, Oct 02, 2018 at 03:13:51PM +0200, Oyvind Albrigtsen wrote: > ClusterLabs is happy to announce fence-agents v4.3.0. > > The source code is available at: > https://github.com/ClusterLabs/fence-agents/releases/tag/v4.3.0 > > The most significant enhancements in this release are: > - new fence agents: > - fence_aliyun > - fence_openstack Could not get openstack to build without a patch. Can you check if this works for you: --- a/configure.ac +++ b/configure.ac @@ -246,8 +246,7 @@ fi if echo "$AGENTS_LIST" | grep -q openstack; then -AC_PYTHON_MODULE(novaclient) -AC_PYTHON_MODULE(keystoneclient) +AC_PYTHON_MODULE(openstackclient) if test "x${HAVE_PYMOD_OPENSTACKCLIENT}" != xyes; then AGENTS_LIST=$(echo "$AGENTS_LIST" | sed -E "s#openstack/fence_openstack.py( |$)##") AC_MSG_WARN("Not building fence_openstack") -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Position of pacemaker in today's HA world
On Fri, Oct 05, 2018 at 11:34:10AM -0500, Ken Gaillot wrote: > The next big challenge is that high availability is becoming a subset > of the "orchestration" space in terms of how we fit into IT > departments. Systemd and Kubernetes are the clear leaders in service > orchestration today and likely will be for a long while. Other forms of > orchestration such as Ansible are also highly relevant. Tighter > integration with these would go a long way toward establishing > longevity. Kubernetes seems to be mostly about stateless services and it kind of expects you have some external highly available data store. You can either buy that data store from your cloud provider or try to make it yourself, for example using Pacemaker and Galera. Maybe some agents for registering Pacemaker resources in the Kubernetes etcd or consul would be useful to connect the two worlds. On the other side, the big players seem to have settled on using distributed consensus protocols (Paxos, Raft, Zab) for building replicated state machines across multiple data centers (or even continents). It would be an interesting experiment if we could easily hook up Pacemaker with Zookeeper - selecting a DC node would require just creating a znode file. Master node can than store the cluster configuration as another znode and also ask other nodes to do some work through a job queue. On a global scale this would be quite slow (5-10 changes per second) and node fencing is not available but for some types of resources it might be useful. Some more reading material on the topic for the weekend :) https://landing.google.com/sre/book/chapters/managing-critical-state.html https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] About fencing stonith
On Thu, Sep 06, 2018 at 04:47:32PM -0400, Digimer wrote: > It depends on the hardware you have available. In your case, RPi has no > IPMI or similar feature, so you'll need something external, like a > switched PDU. I like the APC AP7900 (or your countries variant), which > you can often get used for a decent price if this isn't a production system. > > http://www.apc.com/shop/us/en/products/Rack-PDU-Switched-1U-15A-100-120V-85-15/P-AP7900 RPi should have a reset pin, so it might be possible to cross connect GPIO from one RPi to a reset pin on another and get a cheap reset functionality. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] 2 node cluster dlm/clvm trouble
On Tue, Sep 11, 2018 at 09:31:13AM -0400, Patrick Whitney wrote: > But, when I invoke the "human" stonith power device (i.e. I turn the node > off), the other node collapses... > > In the logs I supplied, I basically do this: > > 1. stonith fence (With fence scsi) After fence_scsi finishes the node should not show any signs of life. If it continues to work on the network after this point it can cause trouble. > 2. verify UI shows fenced node as stopped > 3. power off fenced node Not sure if you use poweroff command to shutdown the node or turn it off some other way? If you don't have any other fence plugin you can use, try testing with meatware. Stonith will wait until you manually confirm with meatclient that the node is down. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] 2 node cluster dlm/clvm trouble
On Tue, Sep 11, 2018 at 04:14:08PM +0300, Vladislav Bogdanov wrote: > And that is not an easy task sometimes, because main part of dlm runs in > kernel. > In some circumstances the only option is to forcibly reset the node. Exactly, killing the power on the node will stop the DLM code running in the kernel too. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] 2 node cluster dlm/clvm trouble
On Tue, Sep 11, 2018 at 09:13:08AM -0400, Patrick Whitney wrote: > So when the cluster suggests that DLM is shutdown on coro-test-1: > Clone Set: dlm-clone [dlm] > Started: [ coro-test-2 ] > Stopped: [ coro-test-1 ] > > ... DLM isn't actually stopped on 1? If you can connect to the node and see dlm services running than it is not stopped: 20101 dlm_controld 20245 dlm_scand 20246 dlm_recv 20247 dlm_send 20248 dlm_recoverd But if you kill the power on the node than it will be gone for sure :) -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] 2 node cluster dlm/clvm trouble
On Tue, Sep 11, 2018 at 09:02:06AM -0400, Patrick Whitney wrote: > What I'm having trouble understanding is why dlm flattens the remaining > "running" node when the already fenced node is shutdown... I'm having > trouble understanding how power fencing would cause dlm to behave any > differently than just shutting down the fenced node. fences_scsi just kills the storage on the node, but dlm continues to run causing problems for the rest of the cluster nodes. So it seems some other fence agent should be used that would kill dlm too. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: pcsd processes using 100% CPU
On Thu, May 24, 2018 at 12:16:16AM -0600, Casey & Gina wrote: > Tried that, it doesn't seem to do anything but prefix the lines with the pid: > > [pid 24923] sched_yield() = 0 > [pid 24923] sched_yield() = 0 > [pid 24923] sched_yield() = 0 We managed to track this down to a fork bug in some Ruby versions: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=876377 It is now fixed in Debian stretch/stable but other distros might still have this problem. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with pacemaker init.d script
On Wed, Jul 11, 2018 at 04:31:31PM -0600, Casey & Gina wrote: > Forgive me for interjecting, but how did you upgrade on Ubuntu? I'm > frustrated with limitations in 1.1.14 (particularly in PCS so not sure > if it's relevant), and Ubuntu is ignoring my bug reports, so it would > be great to upgrade if possible. I'm using Ubuntu 16.04. pcs is a single package in python and ruby so it should be possible to try a newer version and see if it helps. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Problem with pacemaker init.d script
On Wed, Jul 11, 2018 at 08:01:46PM +0200, Salvatore D'angelo wrote: > Yes, but doing what you suggested the system find that sysV is > installed and try to leverage on update-rc.d scripts and the failure > occurs: > > root@pg1:~# systemctl enable corosync > corosync.service is not a native service, redirecting to systemd-sysv-install > Executing /lib/systemd/systemd-sysv-install enable corosync > update-rc.d: error: corosync Default-Start contains no runlevels, aborting. > > the only fix I found was to manipulate manually the header of > /etc/init.d/corosync adding the rows mentioned below. > But this is not a clean approach to solve the issue. > > What pacemaker suggest for newer distributions? You can try using init scripts from the Debian/Ubuntu packages for corosync and pacemaker as they have the runlevel info included. Another option is to get the systemd service files working and than remove the init scripts. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] chap lio-t / iscsitarget disabled - why?
On Tue, Apr 03, 2018 at 04:48:00PM +0200, Stefan Friedel wrote: > we've a running drbd - iscsi cluster (two nodes Debian stretch, pacemaker / > corosync, res group w/ ip + iscsitarget/lio-t + iscsiluns + lvm etc. on top of > drbd etc.). Everything is running fine - but we didn't manage to get CHAP to > work. targetcli / lio-t always switches the authentication off after a > migration > or restart. > > I found the following lines in the iSCSITarget resource file (Debian stretch > /usr/lib/ocf/resource.d/heartbeat/iSCSITarget, also in > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/iSCSITarget.in): > >[...] ># TODO: add CHAP authentication support when it gets added back into LIO >ocf_run targetcli /iscsi/${OCF_RESKEY_iqn}/tpg1/ set attribute > authentication=0 || exit $OCF_ERR_GENERIC >[...] Yes, another comment in that file suggests that CHAP support was not available at that time (2009) in lio and/or lio-t: lio|lio-t) # TODO: Remove incoming_username and incoming_password # from this check when LIO 3.0 gets CHAP authentication unsupported_params="tid incoming_username incoming_password" ;; If you get it working with the current version of targetcli-fb, you can create a pull request in the ClusterLabs repo :) -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] False negative from kamailio resource agent
On Thu, Mar 22, 2018 at 03:36:55PM -0400, Alberto Mijares wrote: > Straight to the question: how can I manually run a resource agent > script (kamailio) simulating the pacemaker's environment without > actually having pacemaker running? You should be able to run it with something like: # OCF_ROOT=/usr/lib/ocf OCF_RESKEY_conffile=/etc/kamailio/kamailio.cfg /usr/lib/ocf/resource.d/heartbeat/kamailio monitor INFO: No PID file found and our kamailio instance is not active > We have this cluster in production and from time to time kamailio > reported a failure when the reality is that kamailio was still > running. The failover produces a small downtime unnecessarily, so we > decided to stop it until we find a solution for that. > > I saw the check functions in the script. They all run OK out of the > pacemaker's environment, so I need to replicate it by my own. Do you have anything in the logs that might say why the monitor action failed for the resource? Maybe it was overloaded for a moment and did not respond to sipsak tests. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] single node fails to start the ocfs2 resource
On Mon, Mar 12, 2018 at 04:31:46PM +0100, Klaus Wenninger wrote: > Nope. Whenever the cluster is completely down... > Otherwise nodes would come up - if not seeing each other - > happily with both starting all services because they don't > know what already had been running on the other node. > Technically it wouldn't even be possible to remember that > they've seen once as Corosync doesn't have "non-volatile-storage" > apart from the config-file. Interesting, I have the following config in a test cluster: nodelist { node { ring0_addr: sid1 nodeid: 1 } node { ring0_addr: sid2 nodeid: 2 } } quorum { # Enable and configure quorum subsystem (default: off) # see also corosync.conf.5 and votequorum.5 provider: corosync_votequorum expected_votes: 1 two_node: 1 } And the behaviour when both nodes are down seems to be: 1. One node up 2. Fence other node 3. Start services Mar 12 18:15:01 sid1 crmd[555]: notice: Connecting to cluster infrastructure: corosync Mar 12 18:15:01 sid1 crmd[555]: notice: Quorum acquired Mar 12 18:15:01 sid1 crmd[555]: notice: Node sid1 state is now member Mar 12 18:15:01 sid1 crmd[555]: notice: State transition S_STARTING -> S_PENDING Mar 12 18:15:23 sid1 crmd[555]: warning: Input I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped Mar 12 18:15:23 sid1 crmd[555]: notice: State transition S_ELECTION -> S_INTEGRATION Mar 12 18:15:23 sid1 crmd[555]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from do_election_check Mar 12 18:15:23 sid1 crmd[555]: notice: Result of probe operation for stonith-sbd on sid1: 7 (not running) Mar 12 18:15:23 sid1 crmd[555]: notice: Result of probe operation for dlm on sid1: 7 (not running) Mar 12 18:15:23 sid1 crmd[555]: notice: Result of probe operation for admin-ip on sid1: 7 (not running) Mar 12 18:15:23 sid1 crmd[555]: notice: Result of probe operation for clusterfs on sid1: 7 (not running) Mar 12 18:15:57 sid1 stonith-ng[551]: notice: Operation 'reboot' [1454] (call 2 from crmd.555) for host 'sid2' with device 'stonith-sbd' returned: 0 (OK) Mar 12 18:15:57 sid1 stonith-ng[551]: notice: Operation reboot of sid2 by sid1 for crmd.555@sid1.ece4f9c5: OK Mar 12 18:15:57 sid1 crmd[555]: notice: Node sid2 state is now lost Mar 12 18:15:58 sid1 crmd[555]: notice: Result of start operation for dlm on sid1: 0 (ok) Mar 12 18:15:58 sid1 crmd[555]: notice: Result of start operation for admin-ip on sid1: 0 (ok) Mar 12 18:15:58 sid1 crmd[555]: notice: Result of start operation for stonith-sbd on sid1: 0 (ok) Mar 12 18:15:58 sid1 crmd[555]: notice: Result of start operation for clusterfs on sid1: 0 (ok) Mar 12 18:15:58 sid1 crmd[555]: notice: Transition 0 (Complete=18, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-32.bz2): Complete Mar 12 18:15:58 sid1 crmd[555]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] single node fails to start the ocfs2 resource
On Mon, Mar 12, 2018 at 01:58:21PM +0100, Klaus Wenninger wrote: > But isn't dlm directly interfering with corosync so > that it would get the quorum state from there? > As you have 2-node set probably on a 2-node-cluster > this would - after both nodes down - wait for all > nodes up first. Isn't wait_for_all only used during cluster installation? votequorum(5): "When WFA is enabled, the cluster will be quorate for the first time only after all nodes have been visible at least once at the same time." -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] trouble with IPaddr2
On Wed, Oct 11, 2017 at 02:36:24PM +0200, Valentin Vidic wrote: > AFAICT, it found a better interface with that subnet and tried > to use it instead of the one specified in the parameters :) > > But maybe IPaddr2 should just skip interface auto-detection > if an explicit interface was given in the parameters? Oh it seems you specified nic only for the monitor operation so it would fallback to auto-detect for start and stop actions: primitive HA_IP-Serv1 IPaddr2 \ params ip=172.16.101.70 cidr_netmask=16 \ op monitor interval=20 timeout=30 on-fail=restart nic=bond0 \ meta target-role=Started So you probably wanted this instead: primitive HA_IP-Serv1 IPaddr2 \ params ip=172.16.101.70 cidr_netmask=16 nic=bond0 \ op monitor interval=20 timeout=30 on-fail=restart \ meta target-role=Started -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] trouble with IPaddr2
On Wed, Oct 11, 2017 at 01:29:40PM +0200, Stefan Krueger wrote: > ohh damn.. thanks a lot for this hint.. I delete all the IPs on enp4s0f0, and > than it works.. > but could you please explain why it now works? why he has a problem with this > IPs? AFAICT, it found a better interface with that subnet and tried to use it instead of the one specified in the parameters :) But maybe IPaddr2 should just skip interface auto-detection if an explicit interface was given in the parameters? -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] trouble with IPaddr2
On Wed, Oct 11, 2017 at 10:51:04AM +0200, Stefan Krueger wrote: > primitive HA_IP-Serv1 IPaddr2 \ > params ip=172.16.101.70 cidr_netmask=16 \ > op monitor interval=20 timeout=30 on-fail=restart nic=bond0 \ > meta target-role=Started There might be something wrong with the network setup because enp4s0f0 gets used instead of bond0: > Oct 11 08:19:32 zfs-serv2 IPaddr2(HA_IP-Serv1)[27672]: INFO: Adding inet > address 172.16.101.70/16 with broadcast address 172.16.255.255 to device > enp4s0f0 Can you share more info on the network of zfs-serv2, for example: ip a? -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync service not automatically started
On Tue, Oct 10, 2017 at 11:26:24AM +0200, Václav Mach wrote: > # The primary network interface > allow-hotplug eth0 > iface eth0 inet dhcp > # This is an autoconfigured IPv6 interface > iface eth0 inet6 auto allow-hotplug or dhcp could be causing problems. You can try disabling corosync and pacemaker so they don't start on boot and start them manually after a few minutes when the network is stable. If it works than you have some kind of a timing issue. You can try using 'auto eth0' or a static IP address to see if it helps... -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync service not automatically started
On Tue, Oct 10, 2017 at 10:35:17AM +0200, Václav Mach wrote: > Oct 10 10:27:05 r1nren.et.cesnet.cz corosync[709]: [QB] Denied > connection, is not ready (709-1337-18) > Oct 10 10:27:06 r1nren.et.cesnet.cz corosync[709]: [QB] Denied > connection, is not ready (709-1337-18) > Oct 10 10:27:07 r1nren.et.cesnet.cz corosync[709]: [QB] Denied > connection, is not ready (709-1337-18) > Oct 10 10:27:08 r1nren.et.cesnet.cz corosync[709]: [QB] Denied > connection, is not ready (709-1337-18) > Oct 10 10:27:09 r1nren.et.cesnet.cz corosync[709]: [QB] Denied > connection, is not ready (709-1337-18) Could it be that the network or the firewall takes some time to start on boot? -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] PostgreSQL Automatic Failover (PAF) v2.2.0
On Thu, Oct 05, 2017 at 08:55:59PM +0200, Jehan-Guillaume de Rorthais wrote: > It doesn't seems impossible, however I'm not sure of the complexity around > this. > > You would have to either hack PAF and detect failover/migration or create a > new > RA that would always be part of the transition implying your PAF RA to define > if it is moving elsewhere or not. > > It feels the complexity is quite high and would require some expert advices > about Pacemaker internals to avoid wrong or unrelated behaviors or race > conditions. > > But, before going farther, you need to realize a failover will never be > transparent. Especially one that would trigger randomly outside of your > control. Yes, I was thinking more about manual failover, for example to upgrade the postgresql master. RA for pgbouncer would wait for all active queries to finish and queue all new queries. Once there is nothing running on the master anymore, another slave is activated and pgbouncer would than resume queries there. -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] PostgreSQL Automatic Failover (PAF) v2.2.0
On Tue, Sep 12, 2017 at 04:48:19PM +0200, Jehan-Guillaume de Rorthais wrote: > PostgreSQL Automatic Failover (PAF) v2.2.0 has been released on September > 12th 2017 under the PostgreSQL licence. > > See: https://github.com/dalibo/PAF/releases/tag/v2.2.0 > > PAF is a PostgreSQL resource agent for Pacemaker. Its original aim is to > keep it clear between the Pacemaker administration and the PostgreSQL one, to > keep things simple, documented and yet powerful. Do you think it might be possible to integrate the PostgreSQL replication with pgbouncer for a transparent failover? The idea would be to pause the clients in pgbouncer while moving the replication master so no queries would fail. -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck
On Mon, Sep 11, 2017 at 04:18:08PM +0200, Klaus Wenninger wrote: > Just for my understanding: You are using watchdog-handling in corosync? Corosync package in Debian gets build with --enable-watchdog so by default it takes /dev/watchdog during runtime. Don't think SUSE or RedHat packages get built with --enable-watchdog so this behavior is disabled there. -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck
On Sun, Sep 10, 2017 at 08:27:47AM +0200, Ferenc Wágner wrote: > Confirmed: setting watchdog_device: off cluster wide got rid of the > above warnings. Interesting, what brand or version of IPMI has this problem? -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] XenServer guest and host watchdog
On Fri, Sep 08, 2017 at 09:39:26PM +0100, Andrew Cooper wrote: > Yes. The internal mechanism of the host watchdog is to use one > performance counter to count retired instructions and generate an NMI > roughly once every half second (give or take C and P states). > > Separately, there is a one second timer (the same framework as all other > timers in Xen, including the guest watchdog), which triggers a softirq > (lower priority, runs on the return-to-guest path), which increments a > local variable. If the NMI handler doesn't observe this local variable > incrementing in the timeout period, Xen crash the entire system. Thanks for the explanation. And in addition to the software guest and host watchdogs, an external watchdog device like ipmi_watchdog or iTCO_wdt can be used inside dom0. -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] XenServer guest and host watchdog
On Fri, Sep 08, 2017 at 12:57:12PM +, Mark Syms wrote: > As we discussed regarding the handling of watchdog in XenServer, both > guest and host, I've had a discussion with our subject matter expert > (Andrew, cc'd) on this topic. The guest watchdogs are handled by a > hardware timer in the hypervisor but if the timers themselves are not > serviced within 5 seconds the host watchdog will fire and pull the > host down. I presume the host watchdog is the NMI watchdog described in the Xen Hypervisor Command Line Options? watchdog = force | (Default: false) Run an NMI watchdog on each processor. If a processor is stuck for longer than the watchdog_timeout, a panic occurs. When force is specified, in addition to running an NMI watchdog on each processor, unknown NMIs will still be processed. watchdog_timeout = (Default: 5) Set the NMI watchdog timeout in seconds. Specifying 0 will turn off the watchdog. -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
On Mon, Jul 24, 2017 at 10:38:40AM -0500, Ken Gaillot wrote: > Standby is not necessary, it's just a cautious step that allows the > admin to verify that all resources moved off correctly. The restart that > yum does should be sufficient for pacemaker to move everything. > > A restart shouldn't lead to fencing in any case where something's not > going seriously wrong. I'm not familiar with the "kernel is using it" > message, I haven't run into that before. Right, pacemaker upgrade might not be the biggest problem. I've seen other packages upgrades cause RA monitors to return results like $OCF_NOT_RUNNING or $OCF_ERR_INSTALLED. This of course causes the cluster to react, so I prefer the node standby option :) In this case the pacemaker was trying to stop the resources, the stop action has failed and the upgrading node was killed off by the second node trying to cleanup the mess. The resources should have come up on the second node after that. -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
On Mon, Jul 24, 2017 at 11:01:26AM -0500, Dimitri Maziuk wrote: > Lsof/fuser show the PID of the process holding FS open as "kernel". That could be the NFS server running in the kernel. -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] epic fail
On Sun, Jul 23, 2017 at 07:27:03AM -0500, Dmitri Maziuk wrote: > So yesterday I ran yum update that puled in the new pacemaker and tried to > restart it. The node went into its usual "can't unmount drbd because kernel > is using it" and got stonith'ed in the middle of yum transaction. The end > result: DRBD reports split brain, HA daemons don't start on boot, RPM > database is FUBAR. I've had enough. I'm rebuilding this cluster as centos 6 > + heartbeat R1. It seems you did not put the node into standby before the upgrade as it still had resources running. What was the old/new pacemaker version there? -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Coming in Pacemaker 1.1.17: container bundles
On Fri, Jun 30, 2017 at 12:46:29PM -0500, Ken Gaillot wrote: > The challenge is that some properties are docker-specific and other > container engines will have their own specific properties. > > We decided to go with a tag for each supported engine -- so if we add > support for rkt, we'll add a tag with whatever properties it > needs. Then a would need to contain either a tag or a > tag. > > We did consider a generic alternative like: > > > > > ... > > ... > > > But it was decided that using engine-specific tags would allow for > schema enforcement, and would be more readable. > > The and tags were kept under because we > figured those are essential to the concept of a bundle, and any engine > should support some way of mapping those. Thanks for the explanation, it makes sense :) Now I have a working rkt resource agent and would like to test it. Can you share the pcmk:httpd image mentioned in the docker example? -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Coming in Pacemaker 1.1.17: container bundles
On Fri, Mar 31, 2017 at 05:43:02PM -0500, Ken Gaillot wrote: > Here's an example of the CIB XML syntax (higher-level tools will likely > provide a more convenient interface): > > > > Would it be possible to make this a bit more generic like: so we have support for other container engines like rkt? -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] simple active/active router using pacemaker+corosync
On Thu, Jan 26, 2017 at 12:10:24PM +0100, Arturo Borrero Gonzalez wrote: > I have a rather simple 2 nodes active/active router using pacemaker+corosync. > > Why active-active? Well, one node holds the virtual IPv4 resources and > the other node holds the virtual IPv6 resources. > On failover, both nodes are able to run all the virtual IPv4/IPv6 addresses. > > We have about 30 resources configured, and more will be added in the future. You may need to check some pacemaker limits for this number of resources: * batch-limit (30) The number of jobs that the Transition Engine (TE) is allowed to execute in parallel. The TE is the logic in pacemaker’s CRMd that executes the actions determined by the Policy Engine (PE). The "correct" value will depend on the speed and load of your network and cluster nodes. * migration-limit (-1) The number of migration jobs that the TE is allowed to execute in parallel on a node. A value of -1 means unlimited. > The problems/questions are: > > * The IPv6addr resource agent is so slow. I guess that's because of > the additional checks (pings). I had to switch to IPaddr2 for the > virtual IPv6 resources as well, which improves the failover times a > bit. Is this expected? Any hint here? Can you check how slow it is? It should take 5 seconds to send advertisments so the whole move takes 6-7 seconds which seems resonable to me. The address should be functional most of that time. > * In order to ease management, I created 2 groups, one for all the > IPv4 addresses and other for all the IPv6 addresses. This way, I can > perform operations (such as movements, start/stop) for all the > resources in one go. This has a known drawback: in a group, the > resources are managed in chain by the order of the group. On failover, > this really hurts the movement time, since resources aren't moved in > parallel but sequentially. Any hint here? > > I would like to have a simple way of managing lot of resources in one > go, but without the ordering drawbacks of a group. Guess you could create a Dummy resource and make INIFINITY colloction constraints for the IPs so they follow Dummy as it moves between the nodes :) -- Valentin ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] eventmachine gem in pcsd
On Thu, Jun 30, 2016 at 01:27:25PM +0200, Tomas Jelinek wrote: > It seems eventmachine can be safely dropped as all tests passed without it. Great, thanks for confirming. -- Valentin ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] pcs testsuite status
On Wed, Jun 29, 2016 at 10:31:42AM +0200, Tomas Jelinek wrote: > This should be replaceable by any agent which does not provide unfencing, > i.e. it does not have on_target="1" automatic="1" attributes in name="on" /> . You may need to experiment with few agents to find one which > works. Just changed fence_xvm to fence_dummy and the tests pass with that. -- Valentin ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] pcs testsuite status
I'm trying to run pcs tests on Debian unstable, but there are some strange failures like diffs failing due to an additional space at the end of the line or just with "Error: cannot load cluster status, xml does not conform to the schema" Any idea what could be the issue here? I assume the tests work on RHEL7 so the problem might be with the package versions I'm using: pacemaker: 1.1.15~rc3-2 corosync: 2.3.6-1 pcs: 0.9.152-1 FAIL: testNodeStandby (pcs.test.test_cluster.ClusterTest) -- Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/pcs/test/test_cluster.py", line 45, in testNodeStandby ac(output, "") File "/usr/lib/python2.7/dist-packages/pcs/test/tools/misc.py", line 31, in ac "strings not equal:\n{0}".format(prepare_diff(b, a)) AssertionError: strings not equal: + Error: cannot load cluster status, xml does not conform to the schema == FAIL: testFenceLevels (pcs.test.test_stonith.StonithTest) -- Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/pcs/test/test_stonith.py", line 374, in testFenceLevels assert returnVal == 0 AssertionError == FAIL: testStonithCreation (pcs.test.test_stonith.StonithTest) -- Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/pcs/test/test_stonith.py", line 161, in testStonithCreation """) File "/usr/lib/python2.7/dist-packages/pcs/test/tools/misc.py", line 31, in ac "strings not equal:\n{0}".format(prepare_diff(b, a)) AssertionError: strings not equal: + Error: cannot load cluster status, xml does not conform to the schema Cluster Name: test99 - Corosync Nodes: - rh7-1 rh7-2 - Pacemaker Nodes: - - Resources: - - Stonith Devices: - Resource: test1 (class=stonith type=fence_noxist) - Operations: monitor interval=60s (test1-monitor-interval-60s) - Resource: test2 (class=stonith type=fence_ilo) - Operations: monitor interval=60s (test2-monitor-interval-60s) - Resource: test3 (class=stonith type=fence_ilo) - Attributes: ipaddr=test login=testA - Operations: monitor interval=60s (test3-monitor-interval-60s) - Resource: test-fencing (class=stonith type=fence_apc) - Attributes: pcmk_host_list="rhel7-node1 - Operations: monitor interval=61s (test-fencing-monitor-interval-61s) - Fencing Levels: - - Location Constraints: - Ordering Constraints: - Colocation Constraints: - Ticket Constraints: - - Resources Defaults: - No defaults set - Operations Defaults: - No defaults set - - Cluster Properties: - - Quorum: - Options: == FAIL: testStonithDeleteRemovesLevel (pcs.test.test_stonith.StonithTest) -- Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/pcs/test/test_stonith.py", line 665, in testStonithDeleteRemovesLevel self.assertEqual(returnVal, 0) AssertionError: 1 != 0 == FAIL: test_stonith_create_provides_unfencing (pcs.test.test_stonith.StonithTest) -- Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/pcs/test/test_stonith.py", line 193, in test_stonith_create_provides_unfencing ac(output, "") File "/usr/lib/python2.7/dist-packages/pcs/test/tools/misc.py", line 31, in ac "strings not equal:\n{0}".format(prepare_diff(b, a)) AssertionError: strings not equal: + Error: Agent 'fence_xvm' not found, use --force to override == FAIL: test_node_maintenance (pcs.test.test_node.NodeTest) -- Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/pcs/test/test_node.py", line 31, in test_node_maintenance ac("", output) File "/usr/lib/python2.7/dist-packages/pcs/test/tools/misc.py", line 31, in ac "strings not equal:\n{0}".format(prepare_diff(b, a)) AssertionError: strings not equal: - Error: cannot load cluster status, xml does not conform to the schema == FAIL: test_node_standby (pcs.test.test_node.NodeTest) -- Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/pcs/test/test_node.py", line 145, in test_node_standby ac(output, "") File "/usr/lib/python2.7/dist-packages/pcs/test/tools/misc.py", line 31, in ac "strings not
Re: [ClusterLabs] dlm_controld 4.0.4 exits when crmd is fencing another node
On Fri, Jan 22, 2016 at 07:57:52PM +0300, Vladislav Bogdanov wrote: > Tried reverting this one and a51b2bb ("If an error occurs unlink the > lock file and exit with status 1") one-by-one and both together, the > same result. > > So problem seems to be somewhere deeper. I've got the same fencing problem with dlm-4.0.4 on Debian. Looking at the strace of the dlm_controld process it exits right after returning from the poll call due to SIGCHLD signal: wait4(2279, 0x7ffd2f468afc, WNOHANG, NULL) = 0 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}], 10, 1000) = 0 (Timeout) wait4(2279, 0x7ffd2f468afc, WNOHANG, NULL) = 0 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}], 10, 1000) = 0 (Timeout) wait4(2279, 0x7ffd2f468afc, WNOHANG, NULL) = 0 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}], 10, 1000) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2279, si_uid=0, si_status=0, si_utime=0, si_stime=0} --- rt_sigreturn() = -1 EINTR (Interrupted system call) close(11) = 0 sendto(10, "\240", 1, MSG_NOSIGNAL, NULL, 0) = 1 sendto(17, "\20", 1, MSG_NOSIGNAL, NULL, 0) = 1 poll([{fd=17, events=POLLIN}], 1, 0)= 0 (Timeout) shutdown(17, SHUT_RDWR) = 0 close(17) = 0 munmap(0x7f5f45c26000, 2105344) = 0 munmap(0x7f5f4aeea000, 8248)= 0 munmap(0x7f5f45a24000, 2105344) = 0 munmap(0x7f5f4aee7000, 8248)= 0 munmap(0x7f5f45822000, 2105344) = 0 and in fact there is a recent change in 4.0.4 modifying that part of code: If an error occurs unlink the lock file and exit with status 1 https://git.fedorahosted.org/cgit/dlm.git/commit/?id=a51b2bbe413222829778698e62af88a73ebec233 The bug is caused by the missing braces in the expanded if statement. Do you think we can get a new version out with this patch as the fencing in 4.0.4 does not work properly due to this issue? -- Valentin Index: dlm-4.0.4/dlm_controld/main.c === --- dlm-4.0.4.orig/dlm_controld/main.c +++ dlm-4.0.4/dlm_controld/main.c @@ -1028,9 +1028,10 @@ static int loop(void) for (;;) { rv = poll(pollfd, client_maxi + 1, poll_timeout); if (rv == -1 && errno == EINTR) { - if (daemon_quit && list_empty()) + if (daemon_quit && list_empty()) { rv = 0; goto out; + } if (daemon_quit) { log_error("shutdown ignored, active lockspaces"); daemon_quit = 0; ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org