Re: [ClusterLabs] Cluster node getting stopped from other node(resending mail)
On 06/30/2015 11:30 PM, Arjun Pandey wrote: Hi I am running a 2 node cluster with this config on centos 6.5/6.6 Master/Slave Set: foo-master [foo] Masters: [ messi ] Stopped: [ronaldo ] eth1-CP(ocf::pw:IPaddr): Started messi eth2-UP(ocf::pw:IPaddr): Started messi eth3-UPCP (ocf::pw:IPaddr): Started messi where i have a multi-state resource foo being run in master/slave mode and IPaddr RA is just modified IPAddr2 RA. Additionally i have a collocation constraint for the IP addr to be collocated with the master. Sometimes when i setup the cluster , i find that one of the nodes (the second node that joins ) gets stopped and i find this log. 2015-06-01T13:55:46.153941+05:30 ronaldo pacemaker: Starting Pacemaker Cluster Manager 2015-06-01T13:55:46.233639+05:30 ronaldo attrd[25988]: notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (0) 2015-06-01T13:55:46.234162+05:30 ronaldo crmd[25990]: notice: do_state_transition: State transition S_PENDING - S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAG E origin=do_cl_join_finalize_respond ] 2015-06-01T13:55:46.234701+05:30 ronaldo attrd[25988]: notice: attrd_local_callback: Sending full refresh (origin=crmd) 2015-06-01T13:55:46.234708+05:30 ronaldo attrd[25988]: notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (0) This looks to be the likely reason*** 2015-06-01T13:55:46.254310+05:30 ronaldo crmd[25990]:error: handle_request: We didn't ask to be shut down, yet our DC is telling us too . * Hi Arjun, I'd check the other node's logs at this time, to see why it requested the shutdown. 2015-06-01T13:55:46.254577+05:30 ronaldo crmd[25990]: notice: do_state_transition: State transition S_NOT_DC - S_STOPPING [ input=I_STOP cause=C_HA_MESSAGE origin=route_message ] 2015-06-01T13:55:46.255134+05:30 ronaldo crmd[25990]: notice: lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown... waiting (2 ops remaining) Based on the logs , pacemaker on active was stopping the secondary cloud everytime it joins cluster. This issue seems similar to http://pacemaker.oss.clusterlabs.narkive.com/rVvN8May/node-sends-shutdown-request-to-other-node-error Packages used :- pacemaker-1.1.12-4.el6.x86_64 pacemaker-libs-1.1.12-4.el6.x86_64 pacemaker-cli-1.1.12-4.el6.x86_64 pacemaker-cluster-libs-1.1.12-4.el6.x86_64 pacemaker-debuginfo-1.1.10-14.el6.x86_64 pcsc-lite-libs-1.5.2-13.el6_4.x86_64 pcs-0.9.90-2.el6.centos.2.noarch pcsc-lite-1.5.2-13.el6_4.x86_64 pcsc-lite-openct-0.6.19-4.el6.x86_64 corosync-1.4.1-17.el6.x86_64 corosynclib-1.4.1-17.el6.x86_64 ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Resource stop when another resource run on that node
On 07/01/2015 01:18 AM, John Gogu wrote: Hello, this is what i have setup but is now working 100%: Online: [ node01hb0 node02hb0 ] Full list of resources: IP1_Vir(ocf::heartbeat:IPaddr):Started node01hb0 IP2_Vir(ocf::heartbeat:IPaddr):Started node02hb0 default-resource-stickiness: 2000 Location Constraints: Resource: IP1_Vir Enabled on: node01hb0 (score:1000) Resource: IP2_Vir Disabled on: node01hb0 (score:-INFINITY) Colocation Constraints: IP2_Vir with IP1_Vir (score:-INFINITY) When i move manual the resource IP1_Vir from node01hb0 node02hb0 all is fine, IP2_Vir is stopped. That's what you asked it to do. :) The -INFINITY constraint for IP2_Vir on node01hb0 means that IP2_Vir can *never* run on that node. The -INFINITY constraint for IP2_Vir with IP1_Vir means that IP2_Vir can *never* run on the same node as IP1_Vir. So if IP1_Vir is on node02hb0, then IP2_Vir has nowhere to run. If you want either node to be able to take over either IP when necessary, you don't want any -INFINITY constraints. You can use a score other than -INFINITY to give a preference instead of a requirement. For example, if you want the IPs to run on different nodes whenever possible, you could have a colocation constraint IP2_Vir with IP1_Vir score -3000. Having the score more negative than the resource stickiness means that when a failed node comes back up, one of the IPs will move to it. If you don't want that, use a score less than your stickiness, such as -100. You probably don't want any location constraints, unless there's a reason each IP should be on a specific node in normal operation. When i crash node node01hb0 / stop pacemaker both resources are stopped. This likely depends on your quorum and fencing configuration, and what versions of software you're using. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker failover failure
so did another test: two nodes: node1 and node2 Case: node1 is the active node node2: is pasive if I killall -9 pacemakerd corosync on node 1 the services do not fail over to node2, but if I start corosync and pacemaker on node1 then it fails over to node 2. Where am I mistaking? Alex On Wed, Jul 1, 2015 at 12:42 PM, alex austin alexixa...@gmail.com wrote: Hi all, I have configured a virtual ip and redis in master-slave with corosync pacemaker. If redis fails, then the failover is successful, and redis gets promoted on the other node. However if pacemaker itself fails on the active node, the failover is not performed. Is there anything I missed in the configuration? Here's my configuration (i have hashed the ip address out): node host1.com node host2.com primitive ClusterIP IPaddr2 \ params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \ op monitor interval=1s timeout=20s \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ meta is-managed=true target-role=Started resource-stickiness=500 primitive redis redis \ meta target-role=Master is-managed=true \ op monitor interval=1s role=Master timeout=5s on-fail=restart ms redis_clone redis \ meta notify=true is-managed=true ordered=false interleave=false globally-unique=false target-role=Master migration-threshold=1 colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master colocation ip-on-redis inf: ClusterIP redis_clone:Master property cib-bootstrap-options: \ dc-version=1.1.11-97629de \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=false property redis_replication: \ redis_REPL_INFO=host.com thank you in advance Kind regards, Alex ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker failover failure
On 07/01/2015 08:57 AM, alex austin wrote: I have now configured stonith-enabled=true. What device should I use for fencing given the fact that it's a virtual machine but I don't have access to its configuration. would fence_pcmk do? if so, what parameters should I configure for it to work properly? No, fence_pcmk is not for using in pacemaker, but for using in RHEL6's CMAN to redirect its fencing requests to pacemaker. For a virtual machine, ideally you'd use fence_virtd running on the physical host, but I'm guessing from your comment that you can't do that. Does whoever provides your VM also provide an API for controlling it (starting/stopping/rebooting)? Regarding your original problem, it sounds like the surviving node doesn't have quorum. What version of corosync are you using? If you're using corosync 2, you need two_node: 1 in corosync.conf, in addition to configuring fencing in pacemaker. This is my new config: node dcwbpvmuas004.edc.nam.gm.com \ attributes standby=off node dcwbpvmuas005.edc.nam.gm.com \ attributes standby=off primitive ClusterIP IPaddr2 \ params ip=198.208.86.242 cidr_netmask=23 \ op monitor interval=1s timeout=20s \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ meta is-managed=true target-role=Started resource-stickiness=500 primitive pcmk-fencing stonith:fence_pcmk \ params pcmk_host_list=dcwbpvmuas004.edc.nam.gm.com dcwbpvmuas005.edc.nam.gm.com \ op monitor interval=10s \ meta target-role=Started primitive redis redis \ meta target-role=Master is-managed=true \ op monitor interval=1s role=Master timeout=5s on-fail=restart ms redis_clone redis \ meta notify=true is-managed=true ordered=false interleave=false globally-unique=false target-role=Master migration-threshold=1 colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master colocation ip-on-redis inf: ClusterIP redis_clone:Master colocation pcmk-fencing-on-redis inf: pcmk-fencing redis_clone:Master property cib-bootstrap-options: \ dc-version=1.1.11-97629de \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=true property redis_replication: \ redis_REPL_INFO=dcwbpvmuas005.edc.nam.gm.com On Wed, Jul 1, 2015 at 2:53 PM, Nekrasov, Alexander alexander.nekra...@emc.com wrote: stonith-enabled=false this might be the issue. The way peer node death is resolved, the surviving node must call STONITH on the peer. If it’s disabled it might not be able to resolve the event Alex *From:* alex austin [mailto:alexixa...@gmail.com] *Sent:* Wednesday, July 01, 2015 9:51 AM *To:* Users@clusterlabs.org *Subject:* Re: [ClusterLabs] Pacemaker failover failure So I noticed that if I kill redis on one node, it starts on the other, no problem, but if I actually kill pacemaker itself on one node, the other doesn't sense it so it doesn't fail over. On Wed, Jul 1, 2015 at 12:42 PM, alex austin alexixa...@gmail.com wrote: Hi all, I have configured a virtual ip and redis in master-slave with corosync pacemaker. If redis fails, then the failover is successful, and redis gets promoted on the other node. However if pacemaker itself fails on the active node, the failover is not performed. Is there anything I missed in the configuration? Here's my configuration (i have hashed the ip address out): node host1.com node host2.com primitive ClusterIP IPaddr2 \ params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \ op monitor interval=1s timeout=20s \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ meta is-managed=true target-role=Started resource-stickiness=500 primitive redis redis \ meta target-role=Master is-managed=true \ op monitor interval=1s role=Master timeout=5s on-fail=restart ms redis_clone redis \ meta notify=true is-managed=true ordered=false interleave=false globally-unique=false target-role=Master migration-threshold=1 colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master colocation ip-on-redis inf: ClusterIP redis_clone:Master property cib-bootstrap-options: \ dc-version=1.1.11-97629de \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=false property redis_replication: \ redis_REPL_INFO=host.com ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker failover failure
So I noticed that if I kill redis on one node, it starts on the other, no problem, but if I actually kill pacemaker itself on one node, the other doesn't sense it so it doesn't fail over. On Wed, Jul 1, 2015 at 12:42 PM, alex austin alexixa...@gmail.com wrote: Hi all, I have configured a virtual ip and redis in master-slave with corosync pacemaker. If redis fails, then the failover is successful, and redis gets promoted on the other node. However if pacemaker itself fails on the active node, the failover is not performed. Is there anything I missed in the configuration? Here's my configuration (i have hashed the ip address out): node host1.com node host2.com primitive ClusterIP IPaddr2 \ params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \ op monitor interval=1s timeout=20s \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ meta is-managed=true target-role=Started resource-stickiness=500 primitive redis redis \ meta target-role=Master is-managed=true \ op monitor interval=1s role=Master timeout=5s on-fail=restart ms redis_clone redis \ meta notify=true is-managed=true ordered=false interleave=false globally-unique=false target-role=Master migration-threshold=1 colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master colocation ip-on-redis inf: ClusterIP redis_clone:Master property cib-bootstrap-options: \ dc-version=1.1.11-97629de \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=false property redis_replication: \ redis_REPL_INFO=host.com thank you in advance Kind regards, Alex ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker failover failure
I am running version 1.4.7 of corosync On Wed, Jul 1, 2015 at 3:25 PM, Ken Gaillot kgail...@redhat.com wrote: On 07/01/2015 08:57 AM, alex austin wrote: I have now configured stonith-enabled=true. What device should I use for fencing given the fact that it's a virtual machine but I don't have access to its configuration. would fence_pcmk do? if so, what parameters should I configure for it to work properly? No, fence_pcmk is not for using in pacemaker, but for using in RHEL6's CMAN to redirect its fencing requests to pacemaker. For a virtual machine, ideally you'd use fence_virtd running on the physical host, but I'm guessing from your comment that you can't do that. Does whoever provides your VM also provide an API for controlling it (starting/stopping/rebooting)? Regarding your original problem, it sounds like the surviving node doesn't have quorum. What version of corosync are you using? If you're using corosync 2, you need two_node: 1 in corosync.conf, in addition to configuring fencing in pacemaker. This is my new config: node dcwbpvmuas004.edc.nam.gm.com \ attributes standby=off node dcwbpvmuas005.edc.nam.gm.com \ attributes standby=off primitive ClusterIP IPaddr2 \ params ip=198.208.86.242 cidr_netmask=23 \ op monitor interval=1s timeout=20s \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ meta is-managed=true target-role=Started resource-stickiness=500 primitive pcmk-fencing stonith:fence_pcmk \ params pcmk_host_list=dcwbpvmuas004.edc.nam.gm.com dcwbpvmuas005.edc.nam.gm.com \ op monitor interval=10s \ meta target-role=Started primitive redis redis \ meta target-role=Master is-managed=true \ op monitor interval=1s role=Master timeout=5s on-fail=restart ms redis_clone redis \ meta notify=true is-managed=true ordered=false interleave=false globally-unique=false target-role=Master migration-threshold=1 colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master colocation ip-on-redis inf: ClusterIP redis_clone:Master colocation pcmk-fencing-on-redis inf: pcmk-fencing redis_clone:Master property cib-bootstrap-options: \ dc-version=1.1.11-97629de \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=true property redis_replication: \ redis_REPL_INFO=dcwbpvmuas005.edc.nam.gm.com On Wed, Jul 1, 2015 at 2:53 PM, Nekrasov, Alexander alexander.nekra...@emc.com wrote: stonith-enabled=false this might be the issue. The way peer node death is resolved, the surviving node must call STONITH on the peer. If it’s disabled it might not be able to resolve the event Alex *From:* alex austin [mailto:alexixa...@gmail.com] *Sent:* Wednesday, July 01, 2015 9:51 AM *To:* Users@clusterlabs.org *Subject:* Re: [ClusterLabs] Pacemaker failover failure So I noticed that if I kill redis on one node, it starts on the other, no problem, but if I actually kill pacemaker itself on one node, the other doesn't sense it so it doesn't fail over. On Wed, Jul 1, 2015 at 12:42 PM, alex austin alexixa...@gmail.com wrote: Hi all, I have configured a virtual ip and redis in master-slave with corosync pacemaker. If redis fails, then the failover is successful, and redis gets promoted on the other node. However if pacemaker itself fails on the active node, the failover is not performed. Is there anything I missed in the configuration? Here's my configuration (i have hashed the ip address out): node host1.com node host2.com primitive ClusterIP IPaddr2 \ params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \ op monitor interval=1s timeout=20s \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ meta is-managed=true target-role=Started resource-stickiness=500 primitive redis redis \ meta target-role=Master is-managed=true \ op monitor interval=1s role=Master timeout=5s on-fail=restart ms redis_clone redis \ meta notify=true is-managed=true ordered=false interleave=false globally-unique=false target-role=Master migration-threshold=1 colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master colocation ip-on-redis inf: ClusterIP redis_clone:Master property cib-bootstrap-options: \ dc-version=1.1.11-97629de \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=false property redis_replication: \ redis_REPL_INFO=host.com ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started:
Re: [ClusterLabs] Pacemaker failover failure
On 07/01/2015 09:39 AM, alex austin wrote: This is what crm_mon shows Last updated: Wed Jul 1 10:35:40 2015 Last change: Wed Jul 1 09:52:46 2015 Stack: classic openais (with plugin) Current DC: host2 - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 4 Resources configured Online: [ host1 host2 ] ClusterIP (ocf::heartbeat:IPaddr2): Started host2 Master/Slave Set: redis_clone [redis] Masters: [ host2 ] Slaves: [ host1 ] pcmk-fencing(stonith:fence_pcmk): Started host2 On Wed, Jul 1, 2015 at 3:37 PM, alex austin alexixa...@gmail.com wrote: I am running version 1.4.7 of corosync If you can't upgrade to corosync 2 (which has many improvements), you'll need to set the no-quorum-policy=ignore cluster option. Proper fencing is necessary to avoid a split-brain situation, which can corrupt your data. On Wed, Jul 1, 2015 at 3:25 PM, Ken Gaillot kgail...@redhat.com wrote: On 07/01/2015 08:57 AM, alex austin wrote: I have now configured stonith-enabled=true. What device should I use for fencing given the fact that it's a virtual machine but I don't have access to its configuration. would fence_pcmk do? if so, what parameters should I configure for it to work properly? No, fence_pcmk is not for using in pacemaker, but for using in RHEL6's CMAN to redirect its fencing requests to pacemaker. For a virtual machine, ideally you'd use fence_virtd running on the physical host, but I'm guessing from your comment that you can't do that. Does whoever provides your VM also provide an API for controlling it (starting/stopping/rebooting)? Regarding your original problem, it sounds like the surviving node doesn't have quorum. What version of corosync are you using? If you're using corosync 2, you need two_node: 1 in corosync.conf, in addition to configuring fencing in pacemaker. This is my new config: node dcwbpvmuas004.edc.nam.gm.com \ attributes standby=off node dcwbpvmuas005.edc.nam.gm.com \ attributes standby=off primitive ClusterIP IPaddr2 \ params ip=198.208.86.242 cidr_netmask=23 \ op monitor interval=1s timeout=20s \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ meta is-managed=true target-role=Started resource-stickiness=500 primitive pcmk-fencing stonith:fence_pcmk \ params pcmk_host_list=dcwbpvmuas004.edc.nam.gm.com dcwbpvmuas005.edc.nam.gm.com \ op monitor interval=10s \ meta target-role=Started primitive redis redis \ meta target-role=Master is-managed=true \ op monitor interval=1s role=Master timeout=5s on-fail=restart ms redis_clone redis \ meta notify=true is-managed=true ordered=false interleave=false globally-unique=false target-role=Master migration-threshold=1 colocation ClusterIP-on-redis inf: ClusterIP redis_clone:Master colocation ip-on-redis inf: ClusterIP redis_clone:Master colocation pcmk-fencing-on-redis inf: pcmk-fencing redis_clone:Master property cib-bootstrap-options: \ dc-version=1.1.11-97629de \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=true property redis_replication: \ redis_REPL_INFO=dcwbpvmuas005.edc.nam.gm.com On Wed, Jul 1, 2015 at 2:53 PM, Nekrasov, Alexander alexander.nekra...@emc.com wrote: stonith-enabled=false this might be the issue. The way peer node death is resolved, the surviving node must call STONITH on the peer. If it’s disabled it might not be able to resolve the event Alex *From:* alex austin [mailto:alexixa...@gmail.com] *Sent:* Wednesday, July 01, 2015 9:51 AM *To:* Users@clusterlabs.org *Subject:* Re: [ClusterLabs] Pacemaker failover failure So I noticed that if I kill redis on one node, it starts on the other, no problem, but if I actually kill pacemaker itself on one node, the other doesn't sense it so it doesn't fail over. On Wed, Jul 1, 2015 at 12:42 PM, alex austin alexixa...@gmail.com wrote: Hi all, I have configured a virtual ip and redis in master-slave with corosync pacemaker. If redis fails, then the failover is successful, and redis gets promoted on the other node. However if pacemaker itself fails on the active node, the failover is not performed. Is there anything I missed in the configuration? Here's my configuration (i have hashed the ip address out): node host1.com node host2.com primitive ClusterIP IPaddr2 \ params ip=xxx.xxx.xxx.xxx cidr_netmask=23 \ op monitor interval=1s timeout=20s \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ meta is-managed=true target-role=Started resource-stickiness=500 primitive redis redis \ meta target-role=Master is-managed=true \ op monitor interval=1s role=Master timeout=5s on-fail=restart ms redis_clone redis \ meta