>From what I see, he is using heartbeat. 2015-08-03 17:14 GMT+02:00 Thomas Meagher <thomas.meag...@hds.com>: > > Sounds similar to the issue I described here last week. We also had two > nodes, and lost network connection between the two nodes while one was > starting up after a fence. Although we had stonith resources configured, > those resources were never called, and the cluster was considered active on > both nodes throughout the network split. We were able to reproduce this > issue in our lab, it seems there is a window during corosync startup where > if a node joins the cluster and then leaves before Pacemaker stonith > resources have started, it will not be fenced. This issue may be isolated > to two node systems, as normally a single node that is separated from > cluster will have lost quorum, which is not the case with two_node. > > Are you running with "two_node" in corosync.conf? > Are you running with "wait_for_all"? (It's on by default with "two_node") > > ________________________________ > From: Chris Walker [christopher.wal...@gmail.com] > Sent: Sunday, August 02, 2015 23:02 > To: pacemaker@oss.clusterlabs.org > Subject: [Pacemaker] Node lost early in HA startup --> no STONITH > > Hello, > > We recently had an unfortunate sequence on our two-node cluster (nodes n02 > and n03) that can be summarized as: > 1. n03 became pathologically busy and was STONITHed by n02 > 2. The heavy load migrated to n02, which also became pathologically busy > 3. n03 was rebooted > 4. During the startup of HA on n03, n02 was initially seen by n03: > > Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais is > now online > > 5. But later during the startup sequence (after DC election and CIB sync) > we see n02 die (n02 is really wrapped around the axle, many stuck threads, > etc) > > Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead > ... > Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status: n02 > is now lost (was member) > > our deadtime is 240 seconds, so n02 became unresponsive almost immediately > after n03 reported it up at 15:23:43 > > 6. The troubling aspect of this incident is that even though there are > multiple STONITH resources configured for n03, none of them was engaged and > n03 then mounted filesystems that were also active on n02. > > I'm wondering whether the fact that no STONITH resources were started by > this time explains why n02 was not STONITHed. Shortly after n02 is declared > dead we see STONITH resources begin starting, e.g., > > Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start > n03-3-ipmi-stonith (n03) > > Does the fact that since there were no active STONITH resources when n02 was > declared dead, no STONITH action was taken against this node? Is there a > fix/workaround for this scenario (we're using heartbeat 3.0.5 and pacemaker > 3.1.6 (RHEL6.2))? > > Thanks very much! > Chris > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
-- .~. /V\ // \\ /( )\ ^`~'^ _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org