Re: [Pacemaker] Fencing of movable VirtualDomains
Andrew Beekhof and...@beekhof.net writes: Maybe not, the collocation should be sufficient, but even without the orders, unclean VMs fencing is tried with other Stonith devices. Which other devices? The config you sent through didnt have any others. Sorry I sent it to linux-cluster mailing-list but not here, I attach it. I'll switch to newer corosync/pacemaker and use the pacemaker_remote if I can manage dlm/cLVM/OCFS2 with it. No can do. All three services require corosync on the node. Ok, so the remote is useless in my case, but upgrading seems required[1] in my case since wheezy software stack looks to old. Thanks. Footnotes: [1] http://article.gmane.org/gmane.linux.redhat.cluster/22963 -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF node nebula1 node nebula2 node nebula3 node one node quorum \ attributes standby=on primitive ONE-Frontend ocf:heartbeat:VirtualDomain \ params config=/var/lib/one/datastores/one/one.xml \ op start interval=0 timeout=90 \ op stop interval=0 timeout=100 \ meta target-role=Stopped primitive ONE-OCFS2-datastores ocf:heartbeat:Filesystem \ params device=/dev/one-fs/datastores directory=/var/lib/one/datastores fstype=ocfs2 \ op start interval=0 timeout=90 \ op stop interval=0 timeout=100 \ op monitor interval=20 timeout=40 primitive ONE-vg ocf:heartbeat:LVM \ params volgrpname=one-fs \ op start interval=0 timeout=30 \ op stop interval=0 timeout=30 \ op monitor interval=60 timeout=30 primitive Quorum-Node ocf:heartbeat:VirtualDomain \ params config=/var/lib/libvirt/qemu/pcmk/quorum.xml \ op start interval=0 timeout=90 \ op stop interval=0 timeout=100 \ meta target-role=Started primitive Stonith-ONE-Frontend stonith:external/libvirt \ params hostlist=one hypervisor_uri=qemu:///system pcmk_host_list=one pcmk_host_check=static-list \ op monitor interval=30m \ meta target-role=Started primitive Stonith-Quorum-Node stonith:external/libvirt \ params hostlist=quorum hypervisor_uri=qemu:///system pcmk_host_list=quorum pcmk_host_check=static-list \ op monitor interval=30m \ meta target-role=Started primitive Stonith-nebula1-IPMILAN stonith:external/ipmi \ params hostname=nebula1-ipmi ipaddr=A.B.C.D interface=lanplus userid=user passwd=X passwd_method=env priv=operator pcmk_host_list=nebula1 pcmk_host_check=static-list priority=10 \ op monitor interval=30m \ meta target-role=Started primitive Stonith-nebula2-IPMILAN stonith:external/ipmi \ params hostname=nebula2-ipmi ipaddr=A.B.C.D interface=lanplus userid=user passwd=X passwd_method=env priv=operator pcmk_host_list=nebula2 pcmk_host_check=static-list priority=20 \ op monitor interval=30m \ meta target-role=Started primitive Stonith-nebula3-IPMILAN stonith:external/ipmi \ params hostname=nebula3-ipmi ipaddr=A.B.C.D interface=lanplus userid=user passwd=X passwd_method=env priv=operator pcmk_host_list=nebula3 pcmk_host_check=static-list priority=30 \ op monitor interval=30m \ meta target-role=Started primitive clvm ocf:lvm2:clvm \ op start interval=0 timeout=90 \ op stop interval=0 timeout=90 \ op monitor interval=60 timeout=90 primitive dlm ocf:pacemaker:controld \ op start interval=0 timeout=90 \ op stop interval=0 timeout=100 \ op monitor interval=60 timeout=60 primitive o2cb ocf:pacemaker:o2cb \ params stack=pcmk daemon_timeout=30 \ op start interval=0 timeout=90 \ op stop interval=0 timeout=100 \ op monitor interval=60 timeout=60 group ONE-Storage dlm o2cb clvm ONE-vg ONE-OCFS2-datastores clone ONE-Storage-Clone ONE-Storage \ meta interleave=true target-role=Started location Nebula1-does-not-fence-itslef Stonith-nebula1-IPMILAN \ rule $id=Nebula1-does-not-fence-itslef-rule inf: #uname ne nebula1 location Nebula2-does-not-fence-itslef Stonith-nebula2-IPMILAN \ rule $id=Nebula2-does-not-fence-itslef-rule inf: #uname ne nebula2 location Nebula3-does-not-fence-itslef Stonith-nebula3-IPMILAN \ rule $id=Nebula3-does-not-fence-itslef-rule inf: #uname ne nebula3 location Nodes-with-ONE-Storage ONE-Storage-Clone \ rule $id=Nodes-with-ONE-Storage-rule inf: #uname eq nebula1 or #uname eq nebula2 or #uname eq nebula3 or #uname eq one location ONE-Fontend-fenced-by-hypervisor Stonith-ONE-Frontend \ rule $id=ONE-Fontend-fenced-by-hypervisor-rule inf: #uname ne quorum or #uname ne one location ONE-Frontend-run-on-hypervisor ONE-Frontend \ rule $id=ONE-Frontend-run-on-hypervisor-rule 40: #uname eq nebula1 \ rule $id=ONE-Frontend-run-on-hypervisor-rule-0 30: #uname eq nebula2 \ rule
[Pacemaker] unknown third node added to a 2 node cluster?
Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5 and node6 I saw an unknown third node being added to the cluster, but only on node5: Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 12: memb=2, new=0, lost=0 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: memb: node6 3713011210 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: memb: node5 3729788426 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 12: memb=3, new=1, lost=0 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: update_member: Creating entry for node 2085752330 born on 12 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: update_member: Node 2085752330/unknown is now: member Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: NEW: .pending. 2085752330 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: MEMB: node6 3713011210 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: MEMB: node5 3729788426 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: MEMB: .pending. 2085752330 Above is where this third node seems to appear. Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: send_member_notification: Sending membership update 12 to 2 children Sep 18 22:52:16 node5 corosync[17321]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 18 22:52:16 node5 cib[17371]: notice: crm_update_peer_state: plugin_handle_membership: Node (null)[2085752330] - state is now member (was (null)) Sep 18 22:52:16 node5 crmd[17376]: notice: crm_update_peer_state: plugin_handle_membership: Node (null)[2085752330] - state is now member (was (null)) Sep 18 22:52:16 node5 crmd[17376]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Sep 18 22:52:16 node5 crmd[17376]:error: join_make_offer: No recipient for welcome message Sep 18 22:52:16 node5 crmd[17376]: warning: do_state_transition: Only 2 of 3 cluster nodes are eligible to run resources - continue 0 Sep 18 22:52:16 node5 attrd[17374]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Sep 18 22:52:16 node5 attrd[17374]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Sep 18 22:52:16 node5 stonith-ng[17372]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: Diff: --- 0.31.2 Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: Diff: +++ 0.32.1 4a679012144955c802557a39707247a2 Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: -- nvpair value=Stopped id=res1-meta_attributes-target-role/ Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: ++ nvpair name=target-role id=res1-meta_attributes-target-role value=Started/ Sep 18 22:52:16 node5 pengine[17375]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 18 22:52:16 node5 pengine[17375]: notice: LogActions: Start res1#011(node5) Sep 18 22:52:16 node5 crmd[17376]: notice: te_rsc_command: Initiating action 7: start res1_start_0 on node5 (local) Sep 18 22:52:16 node5 pengine[17375]: notice: process_pe_message: Calculated Transition 22: /var/lib/pacemaker/pengine/pe-input-165.bz2 Sep 18 22:52:16 node5 stonith-ng[17372]: notice: stonith_device_register: Device 'st-fencing' already existed in device list (1 active devices) On node6 at the same time the following was in the log: Sep 18 22:52:15 node6 corosync[11178]: [TOTEM ] Incrementing problem counter for seqid 5 iface 10.128.0.221 to [1 of 10] Sep 18 22:52:16 node6 corosync[11178]: [TOTEM ] Incrementing problem counter for seqid 8 iface 10.128.0.221 to [2 of 10] Sep 18 22:52:17 node6 corosync[11178]: [TOTEM ] Decrementing problem counter for iface 10.128.0.221 to [1 of 10] Sep 18 22:52:19 node6 corosync[11178]: [TOTEM ] ring 1 active with no faults Any idea what's going on here? Cheers, b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] runing abitrary script when resource fails
thanks -Original Message- From: Ken Gaillot [mailto:kjgai...@gleim.com] Sent: Tuesday, 7 October 2014 7:24 AM To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] runing abitrary script when resource fails On 10/06/2014 06:20 AM, Alex Samad - Yieldbroker wrote: Is it possible to do this ? Or even on any major fail, I would like to send a signal to my zabbix server Alex Hi Alex, This sort of thing has been discussed before, for example see http://oss.clusterlabs.org/pipermail/pacemaker/2014-August/022418.html At Gleim, we use an active monitoring approach -- instead of waiting for a notification, our monitor polls the cluster regularly. In our case, we're using the check_crm nagios plugin available at https://github.com/dnsmichi/icinga-plugins/blob/master/scripts/check_crm. It's a fairly simple Perl script utilizing crm_mon, so you could probably tweak the output to fit something zabbix expects, if there isn't an equivalent for zabbix already. And of course you can configure zabbix to monitor the services running on the cluster as well. -- Ken Gaillot kjgai...@gleim.com Network Operations Center, Gleim Publications ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains
Hello Andrew, Am 06.10.2014 04:30, schrieb Andrew Beekhof: On 3 Oct 2014, at 5:07 am, Felix Zachlod fz.li...@sis-gmbh.info wrote: Am 02.10.2014 18:02, schrieb Digimer: On 02/10/14 02:44 AM, Felix Zachlod wrote: I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7 Please upgrade to 1.1.10+! Are you referring to a special bug/ code change? I normally don't like building all this stuff from source instead using the packages if there are not very good reasons for it. I run some 1.1.7 debian base pacemaker clusters for a long time now without any issue and I am sure that this version seems to run very stable so as long as I am not facing a specific problem with this version According to git, there are 1143 specific problems with 1.1.7 In total there have been 3815 commits and 5 releases in the last 2.5 years, we don't do all that for fun :-) I know that there have been a lot changes since this ancient version. But I was just curios if there was something that in specific might be related to my problem. I work tightly connected to software develepment in our company and so i know that newer does not automatically mean with less bugs or especially with less bugs concerning ME. Thats why I suspect install the recent version to be trial end error- which might for sure help in some cases but does not enlight the corresponding problem in any way. On the other hand, if both sides think they have up-to-date data it might not be anything to do with pacemaker at all. That is what I suspect too. and why I passed this question to the drbd mailing list, I am now nearly totally convinced that pacemaker isn't doing anything wrong here cause the drbd RA sets a master score of 1000 on either side which accoring to my constraints was the signal for pacemaker to promote. regards, Felix ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains
On 8 Oct 2014, at 9:20 am, Felix Zachlod fz.li...@sis-gmbh.info wrote: Hello Andrew, Am 06.10.2014 04:30, schrieb Andrew Beekhof: On 3 Oct 2014, at 5:07 am, Felix Zachlod fz.li...@sis-gmbh.info wrote: Am 02.10.2014 18:02, schrieb Digimer: On 02/10/14 02:44 AM, Felix Zachlod wrote: I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7 Please upgrade to 1.1.10+! Are you referring to a special bug/ code change? I normally don't like building all this stuff from source instead using the packages if there are not very good reasons for it. I run some 1.1.7 debian base pacemaker clusters for a long time now without any issue and I am sure that this version seems to run very stable so as long as I am not facing a specific problem with this version According to git, there are 1143 specific problems with 1.1.7 In total there have been 3815 commits and 5 releases in the last 2.5 years, we don't do all that for fun :-) I know that there have been a lot changes since this ancient version. But I was just curios if there was something that in specific might be related to my problem. I work tightly connected to software develepment in our company and so i know that newer does not automatically mean with less bugs or especially with less bugs concerning ME. Particularly where the policy engine is concerned, it is actually true thanks to the 500+ regression tests we have. Also, there have definitely been improvements to master/slave in the last few releases. Check out the release notes, thats where I try to highlight the more interesting/important fixes. Thats why I suspect install the recent version to be trial end error- which might for sure help in some cases but does not enlight the corresponding problem in any way. On the other hand, if both sides think they have up-to-date data it might not be anything to do with pacemaker at all. That is what I suspect too. and why I passed this question to the drbd mailing list, I am now nearly totally convinced that pacemaker isn't doing anything wrong here cause the drbd RA sets a master score of 1000 on either side which accoring to my constraints was the signal for pacemaker to promote. regards, Felix ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] unknown third node added to a 2 node cluster?
On 8 Oct 2014, at 2:09 am, Brian J. Murrell (brian) br...@interlinx.bc.ca wrote: Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5 and node6 I saw an unknown third node being added to the cluster, but only on node5: Is either node using dhcp? I would guess node6 got a new IP address (or that corosync decided to bind to a different one) Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 12: memb=2, new=0, lost=0 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: memb: node6 3713011210 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: memb: node5 3729788426 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 12: memb=3, new=1, lost=0 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: update_member: Creating entry for node 2085752330 born on 12 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: update_member: Node 2085752330/unknown is now: member Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: NEW: .pending. 2085752330 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: MEMB: node6 3713011210 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: MEMB: node5 3729788426 Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: pcmk_peer_update: MEMB: .pending. 2085752330 Above is where this third node seems to appear. Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] info: send_member_notification: Sending membership update 12 to 2 children Sep 18 22:52:16 node5 corosync[17321]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 18 22:52:16 node5 cib[17371]: notice: crm_update_peer_state: plugin_handle_membership: Node (null)[2085752330] - state is now member (was (null)) Sep 18 22:52:16 node5 crmd[17376]: notice: crm_update_peer_state: plugin_handle_membership: Node (null)[2085752330] - state is now member (was (null)) Sep 18 22:52:16 node5 crmd[17376]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Sep 18 22:52:16 node5 crmd[17376]:error: join_make_offer: No recipient for welcome message Sep 18 22:52:16 node5 crmd[17376]: warning: do_state_transition: Only 2 of 3 cluster nodes are eligible to run resources - continue 0 Sep 18 22:52:16 node5 attrd[17374]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Sep 18 22:52:16 node5 attrd[17374]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Sep 18 22:52:16 node5 stonith-ng[17372]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: Diff: --- 0.31.2 Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: Diff: +++ 0.32.1 4a679012144955c802557a39707247a2 Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: -- nvpair value=Stopped id=res1-meta_attributes-target-role/ Sep 18 22:52:16 node5 cib[17371]: notice: cib:diff: ++ nvpair name=target-role id=res1-meta_attributes-target-role value=Started/ Sep 18 22:52:16 node5 pengine[17375]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 18 22:52:16 node5 pengine[17375]: notice: LogActions: Start res1#011(node5) Sep 18 22:52:16 node5 crmd[17376]: notice: te_rsc_command: Initiating action 7: start res1_start_0 on node5 (local) Sep 18 22:52:16 node5 pengine[17375]: notice: process_pe_message: Calculated Transition 22: /var/lib/pacemaker/pengine/pe-input-165.bz2 Sep 18 22:52:16 node5 stonith-ng[17372]: notice: stonith_device_register: Device 'st-fencing' already existed in device list (1 active devices) On node6 at the same time the following was in the log: Sep 18 22:52:15 node6 corosync[11178]: [TOTEM ] Incrementing problem counter for seqid 5 iface 10.128.0.221 to [1 of 10] Sep 18 22:52:16 node6 corosync[11178]: [TOTEM ] Incrementing problem counter for seqid 8 iface 10.128.0.221 to [2 of 10] Sep 18 22:52:17 node6 corosync[11178]: [TOTEM ] Decrementing problem counter for iface 10.128.0.221 to [1 of 10] Sep 18 22:52:19 node6 corosync[11178]: [TOTEM ] ring 1 active with no faults Any idea what's going on here? Cheers, b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org