[ClusterLabs] STONITH not communicated back to initiator until token expires

Chris Walker Mon, 13 Mar 2017 09:11:49 -0700

Hello,

On our two-node EL7 cluster (pacemaker: 1.1.15-11.el7_3.4; corosync:
2.4.0-4; libqb: 1.0-1),
it looks like successful STONITH operations are not communicated from
stonith-ng back to theinitiator (in this case, crmd) until the STONITHed
node is removed from the cluster when
Corosync notices that it's gone (i.e., after the token timeout).


In trace debug logs, I see the STONITH reply sent via the cpg_mcast_joined
(libqb) function in crm_cs_flush
(stonith_send_async_reply->send_cluster_text->send_cluster_text->send_cpg_iov->crm_cs_flush->cpg_mcast_joined)

Mar 13 07:18:22 [6466] bug0 stonith-ng: (  commands.c:1891  )   trace:
stonith_send_async_reply:        Reply   <st-reply st_origin="bug1"
t="stonith-ng" st_op="st_fence" st_device_id="ustonith:0"
st_remote_op="39b1f1e0-b76f-4d25-bd15-77b956c914a0"
st_clientid="823e92da-8470-491a-b662-215526cced22"
st_clientname="crmd.3973" st_target="bug1" st_device_action="st_fence"
st_callid="3" st_callopt="0" st_rc="0" st_output="Chassis Power Control:
Reset\nChassis Power Control: Down/Off\nChassis Power Control: Down/Off\nC
Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:636   )   trace:
send_cluster_text:       Queueing CPG message 9 to all (1041 bytes, 449
bytes payload): <st-reply st_origin="bug1" t="stonith-ng" st_op="st_notify"
st_device_id="ustonith:0"
st_remote_op="39b1f1e0-b76f-4d25-bd15-77b956c914a0"
st_clientid="823e92da-8470-491a-b662-215526cced22" st_clientna
Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:207   )   trace:
send_cpg_iov:    Queueing CPG message 9 (1041 bytes)
Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:170   )   trace:
crm_cs_flush:    CPG message sent, size=1041
Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:185   )   trace:
crm_cs_flush:    Sent 1 CPG messages  (0 remaining, last=9): OK (1)

But I see no further action from stonith-ng until minutes later;
specifically, I don't see remote_op_done run, so the dead node is still an
'online (unclean)' member of the array and no failover can take place.

When the token expires (yes, we use a very long token), I see the following:

Mar 13 07:22:11 [6466] bug0 stonith-ng: (membership.c:1018  )  notice:
crm_update_peer_state_iter:      Node bug1 state is now lost | nodeid=2
previous=member source=crm_update_peer_proc
Mar 13 07:22:11 [6466] bug0 stonith-ng: (      main.c:1275  )   debug:
st_peer_update_callback: Broadcasting our uname because of node 2
Mar 13 07:22:11 [6466] bug0 stonith-ng: (       cpg.c:636   )   trace:
send_cluster_text:       Queueing CPG message 10 to all (666 bytes, 74
bytes payload): <stonith_command __name__="stonith_command" t="stonith-ng"
st_op="poke"/>
...
Mar 13 07:22:11 [6466] bug0 stonith-ng: (  commands.c:2582  )   debug:
stonith_command: Processing st_notify reply 0 from bug0 (               0)
Mar 13 07:22:11 [6466] bug0 stonith-ng: (    remote.c:1945  )   debug:
process_remote_stonith_exec:     Marking call to poweroff for bug1 on
behalf of crmd.3973@39b1f1e0-b76f-4d25-bd15-77b956c914a0.bug1: OK (0)

and the STONITH command is finally communicated back to crmd as complete
and failover commences.

Is this delay a feature of the cpg_mcast_joined function?  If I understand
correctly (unlikely), it looks like cpg_mcast_joined is not completing
because one of the nodes in the group is missing, but I haven't looked at
that code closely yet.  Is it advisable to have stonith-ng broadcast a
membership change when it successfully fences a node?

Attaching logs with PCMK_debug=stonith-ng
and 
PCMK_trace_functions=stonith_send_async_reply,send_cluster_text,send_cpg_iov,crm_cs_flush

Thanks in advance,
Chris

pacemaker.log.bz2
Description: BZip2 compressed data

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] STONITH not communicated back to initiator until token expires

Reply via email to