I saw Thomas's post from last week and it sounded very similar to what we saw, but I wasn't sure if the heartbeat/corosync difference made this a different issue. I'm trying to dup and assemble the log/config info.
Thanks again, Chris On Mon, Aug 3, 2015 at 11:27 AM, emmanuel segura <emi2f...@gmail.com> wrote: > From what I see, he is using heartbeat. > > 2015-08-03 17:14 GMT+02:00 Thomas Meagher <thomas.meag...@hds.com>: > > > > Sounds similar to the issue I described here last week. We also had two > > nodes, and lost network connection between the two nodes while one was > > starting up after a fence. Although we had stonith resources configured, > > those resources were never called, and the cluster was considered active > on > > both nodes throughout the network split. We were able to reproduce this > > issue in our lab, it seems there is a window during corosync startup > where > > if a node joins the cluster and then leaves before Pacemaker stonith > > resources have started, it will not be fenced. This issue may be > isolated > > to two node systems, as normally a single node that is separated from > > cluster will have lost quorum, which is not the case with two_node. > > > > Are you running with "two_node" in corosync.conf? > > Are you running with "wait_for_all"? (It's on by default with "two_node") > > > > ________________________________ > > From: Chris Walker [christopher.wal...@gmail.com] > > Sent: Sunday, August 02, 2015 23:02 > > To: pacemaker@oss.clusterlabs.org > > Subject: [Pacemaker] Node lost early in HA startup --> no STONITH > > > > Hello, > > > > We recently had an unfortunate sequence on our two-node cluster (nodes > n02 > > and n03) that can be summarized as: > > 1. n03 became pathologically busy and was STONITHed by n02 > > 2. The heavy load migrated to n02, which also became pathologically busy > > 3. n03 was rebooted > > 4. During the startup of HA on n03, n02 was initially seen by n03: > > > > Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais > is > > now online > > > > 5. But later during the startup sequence (after DC election and CIB > sync) > > we see n02 die (n02 is really wrapped around the axle, many stuck > threads, > > etc) > > > > Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead > > ... > > Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status: > n02 > > is now lost (was member) > > > > our deadtime is 240 seconds, so n02 became unresponsive almost > immediately > > after n03 reported it up at 15:23:43 > > > > 6. The troubling aspect of this incident is that even though there are > > multiple STONITH resources configured for n03, none of them was engaged > and > > n03 then mounted filesystems that were also active on n02. > > > > I'm wondering whether the fact that no STONITH resources were started by > > this time explains why n02 was not STONITHed. Shortly after n02 is > declared > > dead we see STONITH resources begin starting, e.g., > > > > Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start > > n03-3-ipmi-stonith (n03) > > > > Does the fact that since there were no active STONITH resources when n02 > was > > declared dead, no STONITH action was taken against this node? Is there a > > fix/workaround for this scenario (we're using heartbeat 3.0.5 and > pacemaker > > 3.1.6 (RHEL6.2))? > > > > Thanks very much! > > Chris > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > > -- > .~. > /V\ > // \\ > /( )\ > ^`~'^ > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org