Re: [Pacemaker] why so long to stonith?
On 26/04/2013, at 10:24 AM, David Coulson wrote: > > On 4/25/13 7:43 PM, Andrew Beekhof wrote: >> I certainly hope so :) > So I should complain to our sales people about this BZ before we upgrade our > clusters to 6.4? Actually, I'm going to back-track on this. After further investigation it appears only plugin based clusters (ie. those using corosync.conf) are affected. You won't have any problems if, as recommended, you use cman. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] why so long to stonith?
On 26/04/2013, at 10:24 AM, David Coulson wrote: > > On 4/25/13 7:43 PM, Andrew Beekhof wrote: >> I certainly hope so :) > So I should complain to our sales people about this BZ before we upgrade our > clusters to 6.4? I don't think it would hurt to demonstrate how many people are using it. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] why so long to stonith?
On 4/25/13 7:43 PM, Andrew Beekhof wrote: I certainly hope so :) So I should complain to our sales people about this BZ before we upgrade our clusters to 6.4? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] why so long to stonith?
On 25/04/2013, at 5:22 AM, Brian J. Murrell wrote: > On 13-04-24 01:16 AM, Andrew Beekhof wrote: >> >> Almost certainly you are hitting: >> >>https://bugzilla.redhat.com/show_bug.cgi?id=951340 > > Yup. The patch posted there fixed it. > >> I am doing my best to convince people that make decisions that this is >> worthy of an update before 6.5. > > I've added my voice to the bug, if that's any help. I certainly hope so :) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] why so long to stonith?
On 13-04-24 01:16 AM, Andrew Beekhof wrote: > > Almost certainly you are hitting: > > https://bugzilla.redhat.com/show_bug.cgi?id=951340 Yup. The patch posted there fixed it. > I am doing my best to convince people that make decisions that this is worthy > of an update before 6.5. I've added my voice to the bug, if that's any help. b. signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] why so long to stonith?
On 24/04/2013, at 5:34 AM, Brian J. Murrell wrote: > Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed > (-KILL) corosync on a peer node. Pacemaker seemed to take a long time > to transition to stonithing it though after noticing it was AWOL: [snip] > As you can see, 3 minutes and 10 seconds went by before pacemaker > transitioned from noticing the node unresponsive to stonithing it. > > This smacks of some kind of mis-configured timeout but I'm not aware > of any timeout that would have this effect. > > Thoughts? > b. Almost certainly you are hitting: https://bugzilla.redhat.com/show_bug.cgi?id=951340 I am doing my best to convince people that make decisions that this is worthy of an update before 6.5. The mystery at the moment is why some clusters (ie. all the ones we tested on internally) seem unaffected. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] why so long to stonith?
As I understand it, this is a known issue with the 1.1.8 release. I believe that 1.1.9 is now available from the pacemaker repos and it should fix the problem. digimer On 04/23/2013 03:34 PM, Brian J. Murrell wrote: Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed (-KILL) corosync on a peer node. Pacemaker seemed to take a long time to transition to stonithing it though after noticing it was AWOL: Apr 23 19:05:20 node2 corosync[1324]: [TOTEM ] A processor failed, forming new configuration. Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 188: memb=1, new=0, lost=1 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: lost: node1 4252674240 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 188: memb=1, new=0, lost=0 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: update_member: Node 4252674240/node1 is now: lost Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: send_member_notification: Sending membership update 188 to 2 children Apr 23 19:05:21 node2 corosync[1324]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 23 19:05:21 node2 corosync[1324]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.155) ; members(old:2 left:1) Apr 23 19:05:21 node2 corosync[1324]: [MAIN ] Completed service synchronization, ready to provide service. Apr 23 19:05:21 node2 crmd[1634]: notice: ais_dispatch_message: Membership 188: quorum lost Apr 23 19:05:21 node2 crmd[1634]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost Apr 23 19:05:21 node2 cib[1629]: notice: ais_dispatch_message: Membership 188: quorum lost Apr 23 19:05:21 node2 cib[1629]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost Apr 23 19:08:31 node2 crmd[1634]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Apr 23 19:08:31 node2 pengine[1633]: notice: unpack_config: On loss of CCM Quorum: Ignore Apr 23 19:08:31 node2 pengine[1633]: warning: pe_fence_node: Node node1 will be fenced because the node is no longer part of the cluster Apr 23 19:08:31 node2 pengine[1633]: warning: determine_online_status: Node node1 is unclean Apr 23 19:08:31 node2 pengine[1633]: warning: custom_action: Action MGS_e4a31b_stop_0 on node1 is unrunnable (offline) Apr 23 19:08:31 node2 pengine[1633]: warning: stage6: Scheduling Node node1 for STONITH Apr 23 19:08:31 node2 pengine[1633]: notice: LogActions: Move MGS_e4a31b#011(Started node1 -> node2) Apr 23 19:08:31 node2 crmd[1634]: notice: te_fence_node: Executing reboot fencing operation (15) on node1 (timeout=6) Apr 23 19:08:31 node2 stonith-ng[1630]: notice: handle_request: Client crmd.1634.642b9c6e wants to fence (reboot) 'node1' with device '(any)' Apr 23 19:08:31 node2 stonith-ng[1630]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for node1: fb431eb4-789c-41bc-903e-4041d50e93b4 (0) Apr 23 19:08:31 node2 pengine[1633]: warning: process_pe_message: Calculated Transition 115: /var/lib/pacemaker/pengine/pe-warn-7.bz2 Apr 23 19:08:41 node2 stonith-ng[1630]: notice: log_operation: Operation 'reboot' [27682] (call 0 from crmd.1634) for host 'node1' with device 'st-node1' returned: 0 (OK) Apr 23 19:08:41 node2 stonith-ng[1630]: notice: remote_op_done: Operation reboot of node1 by node2 for crmd.1634@node2.fb431eb4: OK Apr 23 19:08:41 node2 crmd[1634]: notice: tengine_stonith_callback: Stonith operation 3/15:115:0:c118573c-84e3-48bd-8dc9-40de24438385: OK (0) Apr 23 19:08:41 node2 crmd[1634]: notice: tengine_stonith_notify: Peer node1 was terminated (st_notify_fence) by node2 for node2: OK (ref=fb431eb4-789c-41bc-903e-4041d50e93b4) by client crmd.1634 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 192: memb=1, new=0, lost=0 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 192: memb=2, new=1, lost=0 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: update_member: Node 4252674240/node1 is now: member Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: NEW: node1 4252674240 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072 Apr 23
[Pacemaker] why so long to stonith?
Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed (-KILL) corosync on a peer node. Pacemaker seemed to take a long time to transition to stonithing it though after noticing it was AWOL: Apr 23 19:05:20 node2 corosync[1324]: [TOTEM ] A processor failed, forming new configuration. Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 188: memb=1, new=0, lost=1 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: lost: node1 4252674240 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 188: memb=1, new=0, lost=0 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: update_member: Node 4252674240/node1 is now: lost Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: send_member_notification: Sending membership update 188 to 2 children Apr 23 19:05:21 node2 corosync[1324]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 23 19:05:21 node2 corosync[1324]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.155) ; members(old:2 left:1) Apr 23 19:05:21 node2 corosync[1324]: [MAIN ] Completed service synchronization, ready to provide service. Apr 23 19:05:21 node2 crmd[1634]: notice: ais_dispatch_message: Membership 188: quorum lost Apr 23 19:05:21 node2 crmd[1634]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost Apr 23 19:05:21 node2 cib[1629]: notice: ais_dispatch_message: Membership 188: quorum lost Apr 23 19:05:21 node2 cib[1629]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost Apr 23 19:08:31 node2 crmd[1634]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Apr 23 19:08:31 node2 pengine[1633]: notice: unpack_config: On loss of CCM Quorum: Ignore Apr 23 19:08:31 node2 pengine[1633]: warning: pe_fence_node: Node node1 will be fenced because the node is no longer part of the cluster Apr 23 19:08:31 node2 pengine[1633]: warning: determine_online_status: Node node1 is unclean Apr 23 19:08:31 node2 pengine[1633]: warning: custom_action: Action MGS_e4a31b_stop_0 on node1 is unrunnable (offline) Apr 23 19:08:31 node2 pengine[1633]: warning: stage6: Scheduling Node node1 for STONITH Apr 23 19:08:31 node2 pengine[1633]: notice: LogActions: Move MGS_e4a31b#011(Started node1 -> node2) Apr 23 19:08:31 node2 crmd[1634]: notice: te_fence_node: Executing reboot fencing operation (15) on node1 (timeout=6) Apr 23 19:08:31 node2 stonith-ng[1630]: notice: handle_request: Client crmd.1634.642b9c6e wants to fence (reboot) 'node1' with device '(any)' Apr 23 19:08:31 node2 stonith-ng[1630]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for node1: fb431eb4-789c-41bc-903e-4041d50e93b4 (0) Apr 23 19:08:31 node2 pengine[1633]: warning: process_pe_message: Calculated Transition 115: /var/lib/pacemaker/pengine/pe-warn-7.bz2 Apr 23 19:08:41 node2 stonith-ng[1630]: notice: log_operation: Operation 'reboot' [27682] (call 0 from crmd.1634) for host 'node1' with device 'st-node1' returned: 0 (OK) Apr 23 19:08:41 node2 stonith-ng[1630]: notice: remote_op_done: Operation reboot of node1 by node2 for crmd.1634@node2.fb431eb4: OK Apr 23 19:08:41 node2 crmd[1634]: notice: tengine_stonith_callback: Stonith operation 3/15:115:0:c118573c-84e3-48bd-8dc9-40de24438385: OK (0) Apr 23 19:08:41 node2 crmd[1634]: notice: tengine_stonith_notify: Peer node1 was terminated (st_notify_fence) by node2 for node2: OK (ref=fb431eb4-789c-41bc-903e-4041d50e93b4) by client crmd.1634 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 192: memb=1, new=0, lost=0 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 192: memb=2, new=1, lost=0 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: update_member: Node 4252674240/node1 is now: member Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: NEW: node1 4252674240 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: MEMB: node1 4252674240 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: send_member_notification: Sending membership update 192 to 2 children Apr 23