Hi!

I have a two-node cluster (virtual machines) with several resources and shared storage. When the connectivity is lost (for some reason still needed to be debuged), here is what I get (I am skipping unrelated messages)

May 14 16:49:21 wcs2 corosync[27531]: [TOTEM ] The token was lost in the OPERATIONAL state. May 14 16:49:21 wcs2 corosync[27531]: [TOTEM ] A processor failed, forming new configuration.

Why corosync connectivity is lost? There was nothing suspicious in the logs at all.

May 14 16:49:36 wcs2 corosync[27531]: [VOTEQ ] node 739269211 state=2, votes=1, expected=2 May 14 16:49:36 wcs2 corosync[27531]: [VOTEQ ] node 739269212 state=1, votes=1, expected=2 May 14 16:49:36 wcs2 corosync[27531]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 14 16:49:36 wcs2 corosync[27531]:   [QUORUM] Members[1]: 739269212
May 14 16:49:36 wcs2 corosync[27531]: [QUORUM] sending quorum notification to (nil), length = 52 May 14 16:49:36 wcs2 crmd[11381]: warning: match_down_event: No match for shutdown action on 739269211 May 14 16:49:36 wcs2 crmd[11381]: notice: peer_update_callback: Stonith/shutdown of wcs1 not matched

What does that warning mean?

May 14 16:49:37 wcs2 pengine[27574]: notice: unpack_config: On loss of CCM Quorum: Ignore May 14 16:49:37 wcs2 pengine[27574]: warning: pe_fence_node: Node wcs1 will be fenced because stonith_sbd is thought to be active there May 14 16:49:37 wcs2 pengine[27574]: warning: custom_action: Action stonith_sbd_stop_0 on wcs1 is unrunnable (offline) May 14 16:49:37 wcs2 pengine[27574]: warning: stage6: Scheduling Node wcs1 for STONITH May 14 16:49:37 wcs2 pengine[27574]: notice: LogActions: Move stonith_sbd#011(Started wcs1 -> wcs2)

All resources were active on node wcs2 (survived), stonith_sbd was active on node wcs1

May 14 16:49:37 wcs2 crmd[11381]: notice: te_fence_node: Executing reboot fencing operation (38) on wcs1 (timeout=60000) May 14 16:49:37 wcs2 stonith-ng[27571]: notice: handle_request: Client crmd.11381.a02439c4 wants to fence (reboot) 'wcs1' with device '(any)' May 14 16:49:37 wcs2 stonith-ng[27571]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for wcs1: 37151815-2182-42fa-b32e-86288b1808
5b (0)

Now, as these are actually virtual machines, reboot takes place quite quickly:

May 14 16:49:46 wcs2 crmd[11381]: notice: pcmk_quorum_notification: Membership 1000: quorum acquired (2) May 14 16:49:46 wcs2 crmd[11381]: notice: crm_update_peer_state: pcmk_quorum_notification: Node wcs1[739269211] - state is now member May 14 16:50:05 wcs2 crmd[11381]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] May 14 16:50:07 wcs2 attrd[27573]: notice: attrd_local_callback: Sending full refresh (origin=crmd) May 14 16:50:07 wcs2 attrd[27573]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) May 14 16:50:49 wcs2 stonith-ng[27571]: error: remote_op_done: Operation reboot of wcs1 by wcs2 for crmd.11381@wcs2.37151815: Timer expired May 14 16:50:49 wcs2 crmd[11381]: notice: tengine_stonith_callback: Stonith operation 11/38:2655:0:8f1636b7-dd1d-470c-b645-65a9c8743a69: Timer expired (-62) May 14 16:50:49 wcs2 crmd[11381]: notice: tengine_stonith_callback: Stonith operation 11 for wcs1 failed (Timer expired): aborting transition. May 14 16:50:49 wcs2 crmd[11381]: notice: tengine_stonith_notify: Peer wcs1 was not terminated (st_notify_fence) by wcs2 for wcs2: Timer expired (ref=37151815-2182-42fa-b32e-86288b18085b) by client crmd.11381

But why reboot operation timers expire?




_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to