On 02/08/15 11:10 PM, Chris Walker wrote: > Hello, > > We recently had an unfortunate sequence on our two-node cluster (nodes > n02 and n03) that can be summarized as: > 1. n03 became pathologically busy and was STONITHed by n02 > 2. The heavy load migrated to n02, which also became pathologically busy > 3. n03 was rebooted > 4. During the startup of HA on n03, n02 was initially seen by n03: > > Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais > is now online > > 5. But later during the startup sequence (after DC election and CIB > sync) we see n02 die (n02 is really wrapped around the axle, many stuck > threads, etc) > > Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead > ... > Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status: > n02 is now lost (was member) > > our deadtime is 240 seconds, so n02 became unresponsive almost > immediately after n03 reported it up at 15:23:43 > > 6. The troubling aspect of this incident is that even though there are > multiple STONITH resources configured for n03, none of them was engaged > and n03 then mounted filesystems that were also active on n02. > > I'm wondering whether the fact that no STONITH resources were started by > this time explains why n02 was not STONITHed. Shortly after n02 is > declared dead we see STONITH resources begin starting, e.g., > > Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start > n03-3-ipmi-stonith (n03) > > Does the fact that since there were no active STONITH resources when n02 > was declared dead, no STONITH action was taken against this node? Is > there a fix/workaround for this scenario (we're using heartbeat 3.0.5 > and pacemaker 3.1.6 (RHEL6.2))? > > Thanks very much! > Chris
Please share your full config and the logs from both nodes through the duration of the events. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org