Sounds similar to the issue I described here last week.  We also had two nodes, 
and lost network connection between the two nodes while one was starting up 
after a fence.  Although we had stonith resources configured, those resources 
were never called, and the cluster was considered active on both nodes 
throughout the network split.  We were able to reproduce this issue in our lab, 
it seems there is a window during corosync startup where if a node joins the 
cluster and then leaves before Pacemaker stonith resources have started, it 
will not be fenced.  This issue may be isolated to two node systems, as 
normally a single node that is separated from cluster will have lost quorum, 
which is not the case with two_node.

Are you running with "two_node" in corosync.conf?
Are you running with "wait_for_all"? (It's on by default with "two_node")

________________________________
From: Chris Walker [christopher.wal...@gmail.com]
Sent: Sunday, August 02, 2015 23:02
To: pacemaker@oss.clusterlabs.org
Subject: [Pacemaker] Node lost early in HA startup --> no STONITH

Hello,

We recently had an unfortunate sequence on our two-node cluster (nodes n02 and 
n03) that can be summarized as:
1.  n03 became pathologically busy and was STONITHed by n02
2.  The heavy load migrated to n02, which also became pathologically busy
3.  n03 was rebooted
4.  During the startup of HA on n03, n02 was initially seen by n03:

Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais is now 
online

5.  But later during the startup sequence (after DC election and CIB sync) we 
see n02 die (n02 is really wrapped around the axle, many stuck threads, etc)

Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead
...
Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status: n02 is 
now lost (was member)

our deadtime is 240 seconds, so n02 became unresponsive almost immediately 
after n03 reported it up at 15:23:43

6.  The troubling aspect of this incident is that even though there are 
multiple STONITH resources configured for n03, none of them was engaged and n03 
then mounted filesystems that were also active on n02.

I'm wondering whether the fact that no STONITH resources were started by this 
time explains why n02 was not STONITHed.  Shortly after n02 is declared dead we 
see STONITH resources begin starting, e.g.,

Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start   
n03-3-ipmi-stonith (n03)

Does the fact that since there were no active STONITH resources when n02 was 
declared dead, no STONITH action was taken against this node?  Is there a 
fix/workaround for this scenario (we're using heartbeat 3.0.5 and pacemaker 
3.1.6 (RHEL6.2))?

Thanks very much!
Chris
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to