Re: [Pacemaker] Fencing of movable VirtualDomains

2014-10-07 Thread Daniel Dehennin
Andrew Beekhof and...@beekhof.net writes:

 Maybe not, the collocation should be sufficient, but even without the
 orders, unclean VMs fencing is tried with other Stonith devices.

 Which other devices?  The config you sent through didnt have any
 others.

Sorry I sent it to linux-cluster mailing-list but not here, I attach it.

 I'll switch to newer corosync/pacemaker and use the pacemaker_remote if
 I can manage dlm/cLVM/OCFS2 with it.

 No can do.  All three services require corosync on the node. 

Ok, so the remote is useless in my case, but upgrading seems required[1]
in my case since wheezy software stack looks to old.

Thanks.

Footnotes: 
[1]  http://article.gmane.org/gmane.linux.redhat.cluster/22963

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF

node nebula1
node nebula2
node nebula3
node one
node quorum \
attributes standby=on
primitive ONE-Frontend ocf:heartbeat:VirtualDomain \
params config=/var/lib/one/datastores/one/one.xml \
op start interval=0 timeout=90 \
op stop interval=0 timeout=100 \
meta target-role=Stopped
primitive ONE-OCFS2-datastores ocf:heartbeat:Filesystem \
params device=/dev/one-fs/datastores 
directory=/var/lib/one/datastores fstype=ocfs2 \
op start interval=0 timeout=90 \
op stop interval=0 timeout=100 \
op monitor interval=20 timeout=40
primitive ONE-vg ocf:heartbeat:LVM \
params volgrpname=one-fs \
op start interval=0 timeout=30 \
op stop interval=0 timeout=30 \
op monitor interval=60 timeout=30
primitive Quorum-Node ocf:heartbeat:VirtualDomain \
params config=/var/lib/libvirt/qemu/pcmk/quorum.xml \
op start interval=0 timeout=90 \
op stop interval=0 timeout=100 \
meta target-role=Started
primitive Stonith-ONE-Frontend stonith:external/libvirt \
params hostlist=one hypervisor_uri=qemu:///system 
pcmk_host_list=one pcmk_host_check=static-list \
op monitor interval=30m \
meta target-role=Started
primitive Stonith-Quorum-Node stonith:external/libvirt \
params hostlist=quorum hypervisor_uri=qemu:///system 
pcmk_host_list=quorum pcmk_host_check=static-list \
op monitor interval=30m \
meta target-role=Started
primitive Stonith-nebula1-IPMILAN stonith:external/ipmi \
params hostname=nebula1-ipmi ipaddr=A.B.C.D interface=lanplus 
userid=user passwd=X passwd_method=env priv=operator 
pcmk_host_list=nebula1 pcmk_host_check=static-list priority=10 \
op monitor interval=30m \
meta target-role=Started
primitive Stonith-nebula2-IPMILAN stonith:external/ipmi \
params hostname=nebula2-ipmi ipaddr=A.B.C.D interface=lanplus 
userid=user passwd=X passwd_method=env priv=operator 
pcmk_host_list=nebula2 pcmk_host_check=static-list priority=20 \
op monitor interval=30m \
meta target-role=Started
primitive Stonith-nebula3-IPMILAN stonith:external/ipmi \
params hostname=nebula3-ipmi ipaddr=A.B.C.D interface=lanplus 
userid=user passwd=X passwd_method=env priv=operator 
pcmk_host_list=nebula3 pcmk_host_check=static-list priority=30 \
op monitor interval=30m \
meta target-role=Started
primitive clvm ocf:lvm2:clvm \
op start interval=0 timeout=90 \
op stop interval=0 timeout=90 \
op monitor interval=60 timeout=90
primitive dlm ocf:pacemaker:controld \
op start interval=0 timeout=90 \
op stop interval=0 timeout=100 \
op monitor interval=60 timeout=60
primitive o2cb ocf:pacemaker:o2cb \
params stack=pcmk daemon_timeout=30 \
op start interval=0 timeout=90 \
op stop interval=0 timeout=100 \
op monitor interval=60 timeout=60
group ONE-Storage dlm o2cb clvm ONE-vg ONE-OCFS2-datastores
clone ONE-Storage-Clone ONE-Storage \
meta interleave=true target-role=Started
location Nebula1-does-not-fence-itslef Stonith-nebula1-IPMILAN \
rule $id=Nebula1-does-not-fence-itslef-rule inf: #uname ne nebula1
location Nebula2-does-not-fence-itslef Stonith-nebula2-IPMILAN \
rule $id=Nebula2-does-not-fence-itslef-rule inf: #uname ne nebula2
location Nebula3-does-not-fence-itslef Stonith-nebula3-IPMILAN \
rule $id=Nebula3-does-not-fence-itslef-rule inf: #uname ne nebula3
location Nodes-with-ONE-Storage ONE-Storage-Clone \
rule $id=Nodes-with-ONE-Storage-rule inf: #uname eq nebula1 or #uname 
eq nebula2 or #uname eq nebula3 or #uname eq one
location ONE-Fontend-fenced-by-hypervisor Stonith-ONE-Frontend \
rule $id=ONE-Fontend-fenced-by-hypervisor-rule inf: #uname ne quorum 
or #uname ne one
location ONE-Frontend-run-on-hypervisor ONE-Frontend \
rule $id=ONE-Frontend-run-on-hypervisor-rule 40: #uname eq nebula1 \
rule $id=ONE-Frontend-run-on-hypervisor-rule-0 30: #uname eq nebula2 \
rule 

[Pacemaker] unknown third node added to a 2 node cluster?

2014-10-07 Thread Brian J. Murrell (brian)
Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5
and node6 I saw an unknown third node being added to the cluster,
but only on node5:

Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 12: memb=2, new=0, lost=0
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: memb: 
node6 3713011210
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: memb: 
node5 3729788426
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 12: memb=3, new=1, lost=0
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: update_member: Creating 
entry for node 2085752330 born on 12
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: update_member: Node 
2085752330/unknown is now: member
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: NEW:  
.pending. 2085752330
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node6 3713011210
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node5 3729788426
Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
.pending. 2085752330

Above is where this third node seems to appear.

Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: 
send_member_notification: Sending membership update 12 to 2 children
Sep 18 22:52:16 node5 corosync[17321]:   [TOTEM ] A processor joined or left 
the membership and a new membership was formed.
Sep 18 22:52:16 node5 cib[17371]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node (null)[2085752330] - state is now member (was 
(null))
Sep 18 22:52:16 node5 crmd[17376]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node (null)[2085752330] - state is now member (was 
(null))
Sep 18 22:52:16 node5 crmd[17376]:   notice: do_state_transition: State 
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Sep 18 22:52:16 node5 crmd[17376]:error: join_make_offer: No recipient for 
welcome message
Sep 18 22:52:16 node5 crmd[17376]:  warning: do_state_transition: Only 2 of 3 
cluster nodes are eligible to run resources - continue 0
Sep 18 22:52:16 node5 attrd[17374]:   notice: attrd_local_callback: Sending 
full refresh (origin=crmd)
Sep 18 22:52:16 node5 attrd[17374]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: probe_complete (true)
Sep 18 22:52:16 node5 stonith-ng[17372]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: Diff: --- 0.31.2
Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: Diff: +++ 0.32.1 
4a679012144955c802557a39707247a2
Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: --   nvpair 
value=Stopped id=res1-meta_attributes-target-role/
Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: ++   nvpair 
name=target-role id=res1-meta_attributes-target-role value=Started/
Sep 18 22:52:16 node5 pengine[17375]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Sep 18 22:52:16 node5 pengine[17375]:   notice: LogActions: Start   
res1#011(node5)
Sep 18 22:52:16 node5 crmd[17376]:   notice: te_rsc_command: Initiating action 
7: start res1_start_0 on node5 (local)
Sep 18 22:52:16 node5 pengine[17375]:   notice: process_pe_message: Calculated 
Transition 22: /var/lib/pacemaker/pengine/pe-input-165.bz2
Sep 18 22:52:16 node5 stonith-ng[17372]:   notice: stonith_device_register: 
Device 'st-fencing' already existed in device list (1 active devices)

On node6 at the same time the following was in the log:

Sep 18 22:52:15 node6 corosync[11178]:   [TOTEM ] Incrementing problem counter 
for seqid 5 iface 10.128.0.221 to [1 of 10]
Sep 18 22:52:16 node6 corosync[11178]:   [TOTEM ] Incrementing problem counter 
for seqid 8 iface 10.128.0.221 to [2 of 10]
Sep 18 22:52:17 node6 corosync[11178]:   [TOTEM ] Decrementing problem counter 
for iface 10.128.0.221 to [1 of 10]
Sep 18 22:52:19 node6 corosync[11178]:   [TOTEM ] ring 1 active with no faults

Any idea what's going on here?

Cheers,
b.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] runing abitrary script when resource fails

2014-10-07 Thread Alex Samad - Yieldbroker
thanks

-Original Message-
From: Ken Gaillot [mailto:kjgai...@gleim.com] 
Sent: Tuesday, 7 October 2014 7:24 AM
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] runing abitrary script when resource fails

On 10/06/2014 06:20 AM, Alex Samad - Yieldbroker wrote:
 Is it possible to do this ?

 Or even on any major fail, I would like to send a signal to my zabbix 
 server

 Alex

Hi Alex,

This sort of thing has been discussed before, for example see 
http://oss.clusterlabs.org/pipermail/pacemaker/2014-August/022418.html

At Gleim, we use an active monitoring approach -- instead of waiting for a 
notification, our monitor polls the cluster regularly. In our case, we're using 
the check_crm nagios plugin available at 
https://github.com/dnsmichi/icinga-plugins/blob/master/scripts/check_crm. It's 
a fairly simple Perl script utilizing crm_mon, so you could probably tweak the 
output to fit something zabbix expects, if there isn't an equivalent for zabbix 
already.

And of course you can configure zabbix to monitor the services running on the 
cluster as well.

-- Ken Gaillot kjgai...@gleim.com
Network Operations Center, Gleim Publications

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-07 Thread Felix Zachlod

Hello Andrew,

Am 06.10.2014 04:30, schrieb Andrew Beekhof:


On 3 Oct 2014, at 5:07 am, Felix Zachlod fz.li...@sis-gmbh.info wrote:


Am 02.10.2014 18:02, schrieb Digimer:

On 02/10/14 02:44 AM, Felix Zachlod wrote:

I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7


Please upgrade to 1.1.10+!



Are you referring to a special bug/ code change? I normally don't like building 
all this stuff from source instead using the packages if there are not very 
good reasons for it. I run some 1.1.7 debian base pacemaker clusters for a long 
time now without any issue and I am sure that this version seems to run very 
stable so as long as I am not facing a specific problem with this version


According to git, there are 1143 specific problems with 1.1.7
In total there have been 3815 commits and 5 releases in the last 2.5 years, we 
don't do all that for fun :-)


I know that there have been a lot changes since this ancient version. 
But I was just curios if there was something that in specific might be 
related to my problem. I work tightly connected to software develepment 
in our company and so i know that newer does not automatically mean 
with less bugs or especially with less bugs concerning ME. Thats why 
I suspect install the recent version to be trial end error- which 
might for sure help in some cases but does not enlight the corresponding 
problem in any way.



On the other hand, if both sides think they have up-to-date data it might not 
be anything to do with pacemaker at all.


That is what I suspect too. and why I passed this question to the drbd 
mailing list, I am now nearly totally convinced that pacemaker isn't 
doing anything wrong here cause the drbd RA sets a master score of 1000 
on either side which accoring to my constraints was the signal for 
pacemaker to promote.


regards, Felix

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-07 Thread Andrew Beekhof

On 8 Oct 2014, at 9:20 am, Felix Zachlod fz.li...@sis-gmbh.info wrote:

 Hello Andrew,
 
 Am 06.10.2014 04:30, schrieb Andrew Beekhof:
 
 On 3 Oct 2014, at 5:07 am, Felix Zachlod fz.li...@sis-gmbh.info wrote:
 
 Am 02.10.2014 18:02, schrieb Digimer:
 On 02/10/14 02:44 AM, Felix Zachlod wrote:
 I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7
 
 Please upgrade to 1.1.10+!
 
 
 Are you referring to a special bug/ code change? I normally don't like 
 building all this stuff from source instead using the packages if there are 
 not very good reasons for it. I run some 1.1.7 debian base pacemaker 
 clusters for a long time now without any issue and I am sure that this 
 version seems to run very stable so as long as I am not facing a specific 
 problem with this version
 
 According to git, there are 1143 specific problems with 1.1.7
 In total there have been 3815 commits and 5 releases in the last 2.5 years, 
 we don't do all that for fun :-)
 
 I know that there have been a lot changes since this ancient version. But I 
 was just curios if there was something that in specific might be related to 
 my problem. I work tightly connected to software develepment in our company 
 and so i know that newer does not automatically mean with less bugs or 
 especially with less bugs concerning ME.

Particularly where the policy engine is concerned, it is actually true thanks 
to the 500+ regression tests we have.
Also, there have definitely been improvements to master/slave in the last few 
releases.

Check out the release notes, thats where I try to highlight the more 
interesting/important fixes.

 Thats why I suspect install the recent version to be trial end error- which 
 might for sure help in some cases but does not enlight the corresponding 
 problem in any way.
 
 On the other hand, if both sides think they have up-to-date data it might 
 not be anything to do with pacemaker at all.
 
 That is what I suspect too. and why I passed this question to the drbd 
 mailing list, I am now nearly totally convinced that pacemaker isn't doing 
 anything wrong here cause the drbd RA sets a master score of 1000 on either 
 side which accoring to my constraints was the signal for pacemaker to promote.
 
 regards, Felix
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] unknown third node added to a 2 node cluster?

2014-10-07 Thread Andrew Beekhof

On 8 Oct 2014, at 2:09 am, Brian J. Murrell (brian) br...@interlinx.bc.ca 
wrote:

 Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5
 and node6 I saw an unknown third node being added to the cluster,
 but only on node5:

Is either node using dhcp?
I would guess node6 got a new IP address (or that corosync decided to bind to a 
different one)

 
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] notice: pcmk_peer_update: 
 Transitional membership event on ring 12: memb=2, new=0, lost=0
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: 
 memb: node6 3713011210
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: 
 memb: node5 3729788426
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] notice: pcmk_peer_update: 
 Stable membership event on ring 12: memb=3, new=1, lost=0
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: update_member: 
 Creating entry for node 2085752330 born on 12
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: update_member: Node 
 2085752330/unknown is now: member
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: 
 NEW:  .pending. 2085752330
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: 
 MEMB: node6 3713011210
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: 
 MEMB: node5 3729788426
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: pcmk_peer_update: 
 MEMB: .pending. 2085752330
 
 Above is where this third node seems to appear.
 
 Sep 18 22:52:16 node5 corosync[17321]:   [pcmk  ] info: 
 send_member_notification: Sending membership update 12 to 2 children
 Sep 18 22:52:16 node5 corosync[17321]:   [TOTEM ] A processor joined or left 
 the membership and a new membership was formed.
 Sep 18 22:52:16 node5 cib[17371]:   notice: crm_update_peer_state: 
 plugin_handle_membership: Node (null)[2085752330] - state is now member (was 
 (null))
 Sep 18 22:52:16 node5 crmd[17376]:   notice: crm_update_peer_state: 
 plugin_handle_membership: Node (null)[2085752330] - state is now member (was 
 (null))
 Sep 18 22:52:16 node5 crmd[17376]:   notice: do_state_transition: State 
 transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
 origin=abort_transition_graph ]
 Sep 18 22:52:16 node5 crmd[17376]:error: join_make_offer: No recipient 
 for welcome message
 Sep 18 22:52:16 node5 crmd[17376]:  warning: do_state_transition: Only 2 of 3 
 cluster nodes are eligible to run resources - continue 0
 Sep 18 22:52:16 node5 attrd[17374]:   notice: attrd_local_callback: Sending 
 full refresh (origin=crmd)
 Sep 18 22:52:16 node5 attrd[17374]:   notice: attrd_trigger_update: Sending 
 flush op to all hosts for: probe_complete (true)
 Sep 18 22:52:16 node5 stonith-ng[17372]:   notice: unpack_config: On loss of 
 CCM Quorum: Ignore
 Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: Diff: --- 0.31.2
 Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: Diff: +++ 0.32.1 
 4a679012144955c802557a39707247a2
 Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: --   nvpair 
 value=Stopped id=res1-meta_attributes-target-role/
 Sep 18 22:52:16 node5 cib[17371]:   notice: cib:diff: ++   nvpair 
 name=target-role id=res1-meta_attributes-target-role value=Started/
 Sep 18 22:52:16 node5 pengine[17375]:   notice: unpack_config: On loss of CCM 
 Quorum: Ignore
 Sep 18 22:52:16 node5 pengine[17375]:   notice: LogActions: Start   
 res1#011(node5)
 Sep 18 22:52:16 node5 crmd[17376]:   notice: te_rsc_command: Initiating 
 action 7: start res1_start_0 on node5 (local)
 Sep 18 22:52:16 node5 pengine[17375]:   notice: process_pe_message: 
 Calculated Transition 22: /var/lib/pacemaker/pengine/pe-input-165.bz2
 Sep 18 22:52:16 node5 stonith-ng[17372]:   notice: stonith_device_register: 
 Device 'st-fencing' already existed in device list (1 active devices)
 
 On node6 at the same time the following was in the log:
 
 Sep 18 22:52:15 node6 corosync[11178]:   [TOTEM ] Incrementing problem 
 counter for seqid 5 iface 10.128.0.221 to [1 of 10]
 Sep 18 22:52:16 node6 corosync[11178]:   [TOTEM ] Incrementing problem 
 counter for seqid 8 iface 10.128.0.221 to [2 of 10]
 Sep 18 22:52:17 node6 corosync[11178]:   [TOTEM ] Decrementing problem 
 counter for iface 10.128.0.221 to [1 of 10]
 Sep 18 22:52:19 node6 corosync[11178]:   [TOTEM ] ring 1 active with no faults
 
 Any idea what's going on here?
 
 Cheers,
 b.
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org