Hello, We recently had an unfortunate sequence on our two-node cluster (nodes n02 and n03) that can be summarized as: 1. n03 became pathologically busy and was STONITHed by n02 2. The heavy load migrated to n02, which also became pathologically busy 3. n03 was rebooted 4. During the startup of HA on n03, n02 was initially seen by n03:
Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais is now online 5. But later during the startup sequence (after DC election and CIB sync) we see n02 die (n02 is really wrapped around the axle, many stuck threads, etc) Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead ... Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status: n02 is now lost (was member) our deadtime is 240 seconds, so n02 became unresponsive almost immediately after n03 reported it up at 15:23:43 6. The troubling aspect of this incident is that even though there are multiple STONITH resources configured for n03, none of them was engaged and n03 then mounted filesystems that were also active on n02. I'm wondering whether the fact that no STONITH resources were started by this time explains why n02 was not STONITHed. Shortly after n02 is declared dead we see STONITH resources begin starting, e.g., Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start n03-3-ipmi-stonith (n03) Does the fact that since there were no active STONITH resources when n02 was declared dead, no STONITH action was taken against this node? Is there a fix/workaround for this scenario (we're using heartbeat 3.0.5 and pacemaker 3.1.6 (RHEL6.2))? Thanks very much! Chris
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org