So, I'm migrating my working pacemaker configuration from 1.1.7 to 1.1.10 and am finding what appears to be a new behavior in 1.1.10.
If a given node is running a fencing resource and that node goes AWOL, it needs to be fenced (of course). But any other node trying to take over the fencing resource to fence it appears to first want the current owner of the fencing resource to fence the node. Of course that can't happen since the node that needs to do the fencing is AWOL. So while I can buy into the general policy that a node needs to be fenced in order to take over it's resources, fencing resources have to be excepted from this or there can be this catch-22. I believe that is how things were working in 1.1.7 but now that I'm on 1.1.10[-1.el6_4.4] this no longer seems to be the case. Or perhaps there is some additional configuration that 1.1.10 needs to effect this behavior. Here is my configuration: Cluster Name: Corosync Nodes: Pacemaker Nodes: host1 host2 Resources: Resource: rsc1 (class=ocf provider=foo type=Target) Attributes: target=111bad0a-a86a-40e3-b056-c5c93168aa0d Meta Attrs: target-role=Started Operations: monitor interval=5 timeout=60 (rsc1-monitor-5) start interval=0 timeout=300 (rsc1-start-0) stop interval=0 timeout=300 (rsc1-stop-0) Resource: rsc2 (class=ocf provider=chroma type=Target) Attributes: target=a8efa349-4c73-4efc-90d3-d6be7d73c515 Meta Attrs: target-role=Started Operations: monitor interval=5 timeout=60 (rsc2-monitor-5) start interval=0 timeout=300 (rsc2-start-0) stop interval=0 timeout=300 (rsc2-stop-0) Stonith Devices: Resource: st-fencing (class=stonith type=fence_foo) Fencing Levels: Location Constraints: Resource: rsc1 Enabled on: host1 (score:20) (id:rsc1-primary) Enabled on: host2 (score:10) (id:rsc1-secondary) Resource: rsc2 Enabled on: host2 (score:20) (id:rsc2-primary) Enabled on: host1 (score:10) (id:rsc2-secondary) Ordering Constraints: Colocation Constraints: Cluster Properties: cluster-infrastructure: classic openais (with plugin) dc-version: 1.1.10-1.el6_4.4-368c726 expected-quorum-votes: 2 no-quorum-policy: ignore stonith-enabled: true symmetric-cluster: true One thing that PCS is not showing that might be relevant here is that I have a a resource stickiness value set to 1000 to prevent resources from failing back to nodes after a failover. With the above configuration if host1 is shut down, host2 just spins in a loop doing: Dec 2 20:00:02 host2 pengine[8923]: warning: pe_fence_node: Node host1 will be fenced because the node is no longer part of the cluster Dec 2 20:00:02 host2 pengine[8923]: warning: determine_online_status: Node host1 is unclean Dec 2 20:00:02 host2 pengine[8923]: warning: custom_action: Action st-fencing_stop_0 on host1 is unrunnable (offline) Dec 2 20:00:02 host2 pengine[8923]: warning: custom_action: Action rsc1_stop_0 on host1 is unrunnable (offline) Dec 2 20:00:02 host2 pengine[8923]: warning: stage6: Scheduling Node host1 for STONITH Dec 2 20:00:02 host2 pengine[8923]: notice: LogActions: Move st-fencing#011(Started host1 -> host2) Dec 2 20:00:02 host2 pengine[8923]: notice: LogActions: Move rsc1#011(Started host1 -> host2) Dec 2 20:00:02 host2 crmd[8924]: notice: te_fence_node: Executing reboot fencing operation (13) on host1 (timeout=60000) Dec 2 20:00:02 host2 stonith-ng[8920]: notice: handle_request: Client crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)' Dec 2 20:00:02 host2 stonith-ng[8920]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for host1: ad69ead5-0bbb-45d8-bb07-30bcd405ace2 (0) Dec 2 20:00:02 host2 pengine[8923]: warning: process_pe_message: Calculated Transition 22: /var/lib/pacemaker/pengine/pe-warn-2.bz2 Dec 2 20:01:14 host2 stonith-ng[8920]: error: remote_op_done: Operation reboot of host1 by host2 for crmd.8924@host2.ad69ead5: Timer expired Dec 2 20:01:14 host2 crmd[8924]: notice: tengine_stonith_callback: Stonith operation 4/13:22:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired (-62) Dec 2 20:01:14 host2 crmd[8924]: notice: tengine_stonith_callback: Stonith operation 4 for host1 failed (Timer expired): aborting transition. Dec 2 20:01:14 host2 crmd[8924]: notice: tengine_stonith_notify: Peer host1 was not terminated (reboot) by host2 for host2: Timer expired (ref=ad69ead5-0bbb-45d8-bb07-30bcd405ace2) by client crmd.8924 Dec 2 20:01:14 host2 crmd[8924]: notice: run_graph: Transition 22 (Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped Dec 2 20:01:14 host2 pengine[8923]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 2 20:01:14 host2 pengine[8923]: warning: pe_fence_node: Node host1 will be fenced because the node is no longer part of the cluster Dec 2 20:01:14 host2 pengine[8923]: warning: determine_online_status: Node host1 is unclean Dec 2 20:01:14 host2 pengine[8923]: warning: custom_action: Action st-fencing_stop_0 on host1 is unrunnable (offline) Dec 2 20:01:14 host2 pengine[8923]: warning: custom_action: Action rsc1_stop_0 on host1 is unrunnable (offline) Dec 2 20:01:14 host2 pengine[8923]: warning: stage6: Scheduling Node host1 for STONITH Dec 2 20:01:14 host2 pengine[8923]: notice: LogActions: Move st-fencing#011(Started host1 -> host2) Dec 2 20:01:14 host2 pengine[8923]: notice: LogActions: Move rsc1#011(Started host1 -> host2) Dec 2 20:01:14 host2 pengine[8923]: warning: process_pe_message: Calculated Transition 23: /var/lib/pacemaker/pengine/pe-warn-2.bz2 Dec 2 20:01:14 host2 crmd[8924]: notice: te_fence_node: Executing reboot fencing operation (13) on host1 (timeout=60000) Dec 2 20:01:14 host2 stonith-ng[8920]: notice: handle_request: Client crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)' Dec 2 20:01:14 host2 stonith-ng[8920]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for host1: 4c3f947b-12a7-4b6f-84a9-c5ddcbeb55c6 (0) Dec 2 20:02:26 host2 stonith-ng[8920]: error: remote_op_done: Operation reboot of host1 by host2 for crmd.8924@host2.4c3f947b: Timer expired Dec 2 20:02:26 host2 crmd[8924]: notice: tengine_stonith_callback: Stonith operation 5/13:23:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired (-62) Dec 2 20:02:26 host2 crmd[8924]: notice: tengine_stonith_callback: Stonith operation 5 for host1 failed (Timer expired): aborting transition. Dec 2 20:02:26 host2 crmd[8924]: notice: tengine_stonith_notify: Peer host1 was not terminated (reboot) by host2 for host2: Timer expired (ref=4c3f947b-12a7-4b6f-84a9-c5ddcbeb55c6) by client crmd.8924 Dec 2 20:02:26 host2 crmd[8924]: notice: run_graph: Transition 23 (Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped Dec 2 20:02:26 host2 pengine[8923]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 2 20:02:26 host2 pengine[8923]: warning: pe_fence_node: Node host1 will be fenced because the node is no longer part of the cluster Dec 2 20:02:26 host2 pengine[8923]: warning: determine_online_status: Node host1 is unclean Dec 2 20:02:26 host2 pengine[8923]: warning: custom_action: Action st-fencing_stop_0 on host1 is unrunnable (offline) Dec 2 20:02:26 host2 pengine[8923]: warning: custom_action: Action rsc1_stop_0 on host1 is unrunnable (offline) Dec 2 20:02:26 host2 pengine[8923]: warning: stage6: Scheduling Node host1 for STONITH Dec 2 20:02:26 host2 pengine[8923]: notice: LogActions: Move st-fencing#011(Started host1 -> host2) Dec 2 20:02:26 host2 pengine[8923]: notice: LogActions: Move rsc1#011(Started host1 -> host2) Dec 2 20:02:26 host2 crmd[8924]: notice: te_fence_node: Executing reboot fencing operation (13) on host1 (timeout=60000) Dec 2 20:02:26 host2 stonith-ng[8920]: notice: handle_request: Client crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)' Dec 2 20:02:26 host2 stonith-ng[8920]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for host1: 4b9c1ffc-3029-4b6a-8128-63c05f0ef8de (0) Dec 2 20:02:26 host2 pengine[8923]: warning: process_pe_message: Calculated Transition 24: /var/lib/pacemaker/pengine/pe-warn-2.bz2 Dec 2 20:03:38 host2 stonith-ng[8920]: error: remote_op_done: Operation reboot of host1 by host2 for crmd.8924@host2.4b9c1ffc: Timer expired Dec 2 20:03:38 host2 crmd[8924]: notice: tengine_stonith_callback: Stonith operation 6/13:24:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired (-62) Dec 2 20:03:38 host2 crmd[8924]: notice: tengine_stonith_callback: Stonith operation 6 for host1 failed (Timer expired): aborting transition. Dec 2 20:03:38 host2 crmd[8924]: notice: tengine_stonith_notify: Peer host1 was not terminated (reboot) by host2 for host2: Timer expired (ref=4b9c1ffc-3029-4b6a-8128-63c05f0ef8de) by client crmd.8924 Dec 2 20:03:38 host2 crmd[8924]: notice: run_graph: Transition 24 (Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped Dec 2 20:03:38 host2 pengine[8923]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 2 20:03:38 host2 pengine[8923]: warning: pe_fence_node: Node host1 will be fenced because the node is no longer part of the cluster Dec 2 20:03:38 host2 pengine[8923]: warning: determine_online_status: Node host1 is unclean Dec 2 20:03:38 host2 pengine[8923]: warning: custom_action: Action st-fencing_stop_0 on host1 is unrunnable (offline) Dec 2 20:03:38 host2 pengine[8923]: warning: custom_action: Action rsc1_stop_0 on host1 is unrunnable (offline) Dec 2 20:03:38 host2 pengine[8923]: warning: stage6: Scheduling Node host1 for STONITH Dec 2 20:03:38 host2 pengine[8923]: notice: LogActions: Move st-fencing#011(Started host1 -> host2) Dec 2 20:03:38 host2 pengine[8923]: notice: LogActions: Move rsc1#011(Started host1 -> host2) Dec 2 20:03:38 host2 crmd[8924]: notice: te_fence_node: Executing reboot fencing operation (13) on host1 (timeout=60000) Dec 2 20:03:38 host2 stonith-ng[8920]: notice: handle_request: Client crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)' Dec 2 20:03:38 host2 stonith-ng[8920]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for host1: 8200c15c-d138-4b0a-b6df-ac6fe6e46ef1 (0) Dec 2 20:03:38 host2 pengine[8923]: warning: process_pe_message: Calculated Transition 25: /var/lib/pacemaker/pengine/pe-warn-2.bz2 Dec 2 20:04:50 host2 stonith-ng[8920]: error: remote_op_done: Operation reboot of host1 by host2 for crmd.8924@host2.8200c15c: Timer expired Dec 2 20:04:50 host2 crmd[8924]: notice: tengine_stonith_callback: Stonith operation 7/13:25:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired (-62) Dec 2 20:04:50 host2 crmd[8924]: notice: tengine_stonith_callback: Stonith operation 7 for host1 failed (Timer expired): aborting transition. Dec 2 20:04:50 host2 crmd[8924]: notice: tengine_stonith_notify: Peer host1 was not terminated (reboot) by host2 for host2: Timer expired (ref=8200c15c-d138-4b0a-b6df-ac6fe6e46ef1) by client crmd.8924 Dec 2 20:04:50 host2 crmd[8924]: notice: run_graph: Transition 25 (Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped Dec 2 20:04:50 host2 pengine[8923]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 2 20:04:50 host2 pengine[8923]: warning: pe_fence_node: Node host1 will be fenced because the node is no longer part of the cluster Dec 2 20:04:50 host2 pengine[8923]: warning: determine_online_status: Node host1 is unclean Dec 2 20:04:50 host2 pengine[8923]: warning: custom_action: Action st-fencing_stop_0 on host1 is unrunnable (offline) Dec 2 20:04:50 host2 pengine[8923]: warning: custom_action: Action rsc1_stop_0 on host1 is unrunnable (offline) Dec 2 20:04:50 host2 pengine[8923]: warning: stage6: Scheduling Node host1 for STONITH Dec 2 20:04:50 host2 pengine[8923]: notice: LogActions: Move st-fencing#011(Started host1 -> host2) Dec 2 20:04:50 host2 pengine[8923]: notice: LogActions: Move rsc1#011(Started host1 -> host2) Dec 2 20:04:50 host2 pengine[8923]: warning: process_pe_message: Calculated Transition 26: /var/lib/pacemaker/pengine/pe-warn-2.bz2 Dec 2 20:04:50 host2 crmd[8924]: notice: te_fence_node: Executing reboot fencing operation (13) on host1 (timeout=60000) Dec 2 20:04:50 host2 stonith-ng[8920]: notice: handle_request: Client crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)' Dec 2 20:04:50 host2 stonith-ng[8920]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for host1: 8ceabae8-6876-4d6d-b44c-c64c0863f68c (0) So is there something new about 1.1.10 that I am missing? Cheers, b.
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org