[Pacemaker] why so long to stonith?

Brian J. Murrell Tue, 23 Apr 2013 12:39:06 -0700

Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed
(-KILL) corosync on a peer node.  Pacemaker seemed to take a long time
to transition to stonithing it though after noticing it was AWOL:


Apr 23 19:05:20 node2 corosync[1324]:   [TOTEM ] A processor failed, forming 
new configuration.
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 188: memb=1, new=0, lost=1
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: 
node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: lost: 
node1 4252674240
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 188: memb=1, new=0, lost=0
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 
4252674240/node1 is now: lost
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: 
send_member_notification: Sending membership update 188 to 2 children
Apr 23 19:05:21 node2 corosync[1324]:   [TOTEM ] A processor joined or left the 
membership and a new membership was formed.
Apr 23 19:05:21 node2 corosync[1324]:   [CPG   ] chosen downlist: sender r(0) 
ip(192.168.122.155) ; members(old:2 left:1)
Apr 23 19:05:21 node2 corosync[1324]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Apr 23 19:05:21 node2 crmd[1634]:   notice: ais_dispatch_message: Membership 
188: quorum lost
Apr 23 19:05:21 node2 crmd[1634]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:05:21 node2 cib[1629]:   notice: ais_dispatch_message: Membership 
188: quorum lost
Apr 23 19:05:21 node2 cib[1629]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:08:31 node2 crmd[1634]:   notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
origin=crm_timer_popped ]
Apr 23 19:08:31 node2 pengine[1633]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Apr 23 19:08:31 node2 pengine[1633]:  warning: pe_fence_node: Node node1 will 
be fenced because the node is no longer part of the cluster
Apr 23 19:08:31 node2 pengine[1633]:  warning: determine_online_status: Node 
node1 is unclean
Apr 23 19:08:31 node2 pengine[1633]:  warning: custom_action: Action 
MGS_e4a31b_stop_0 on node1 is unrunnable (offline)
Apr 23 19:08:31 node2 pengine[1633]:  warning: stage6: Scheduling Node node1 
for STONITH
Apr 23 19:08:31 node2 pengine[1633]:   notice: LogActions: Move    
MGS_e4a31b#011(Started node1 -> node2)
Apr 23 19:08:31 node2 crmd[1634]:   notice: te_fence_node: Executing reboot 
fencing operation (15) on node1 (timeout=60000)
Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: handle_request: Client 
crmd.1634.642b9c6e wants to fence (reboot) 'node1' with device '(any)'
Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for node1: 
fb431eb4-789c-41bc-903e-4041d50e93b4 (0)
Apr 23 19:08:31 node2 pengine[1633]:  warning: process_pe_message: Calculated 
Transition 115: /var/lib/pacemaker/pengine/pe-warn-7.bz2
Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: log_operation: Operation 
'reboot' [27682] (call 0 from crmd.1634) for host 'node1' with device 
'st-node1' returned: 0 (OK)
Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: remote_op_done: Operation 
reboot of node1 by node2 for crmd.1634@node2.fb431eb4: OK
Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_callback: Stonith 
operation 3/15:115:0:c118573c-84e3-48bd-8dc9-40de24438385: OK (0)
Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_notify: Peer node1 
was terminated (st_notify_fence) by node2 for node2: OK 
(ref=fb431eb4-789c-41bc-903e-4041d50e93b4) by client crmd.1634
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 192: memb=1, new=0, lost=0
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: 
node2 2608507072
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 192: memb=2, new=1, lost=0
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 
4252674240/node1 is now: member
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: NEW:  
node1 4252674240
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node2 2608507072
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node1 4252674240
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: 
send_member_notification: Sending membership update 192 to 2 children
Apr 23 19:09:03 node2 corosync[1324]:   [TOTEM ] A processor joined or left the 
membership and a new membership was formed.
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: update_member: 0x1efc040 
Node 4252674240 (node1) born on: 192
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: 
send_member_notification: Sending membership update 192 to 2 children

As you can see, 3 minutes and 10 seconds went by before pacemaker
transitioned from noticing the node unresponsive to stonithing it.

This smacks of some kind of mis-configured timeout but I'm not aware
of any timeout that would have this effect.

Thoughts?
b.

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] why so long to stonith?

Reply via email to