Re: [ClusterLabs] Node lost early in HA startup --> no STONITH

Digimer Sun, 02 Aug 2015 20:13:17 -0700

On 02/08/15 11:10 PM, Chris Walker wrote:
> Hello,
> 
> We recently had an unfortunate sequence on our two-node cluster (nodes
> n02 and n03) that can be summarized as:
> 1.  n03 became pathologically busy and was STONITHed by n02
> 2.  The heavy load migrated to n02, which also became pathologically busy
> 3.  n03 was rebooted
> 4.  During the startup of HA on n03, n02 was initially seen by n03:
> 
> Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais
> is now online
>  
> 5.  But later during the startup sequence (after DC election and CIB
> sync) we see n02 die (n02 is really wrapped around the axle, many stuck
> threads, etc)
> 
> Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead
> ...
> Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status:
> n02 is now lost (was member)
> 
> our deadtime is 240 seconds, so n02 became unresponsive almost
> immediately after n03 reported it up at 15:23:43
> 
> 6.  The troubling aspect of this incident is that even though there are
> multiple STONITH resources configured for n03, none of them was engaged
> and n03 then mounted filesystems that were also active on n02.
> 
> I'm wondering whether the fact that no STONITH resources were started by
> this time explains why n02 was not STONITHed.  Shortly after n02 is
> declared dead we see STONITH resources begin starting, e.g., 
> 
> Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start  
> n03-3-ipmi-stonith (n03)
> 
> Does the fact that since there were no active STONITH resources when n02
> was declared dead, no STONITH action was taken against this node?  Is
> there a fix/workaround for this scenario (we're using heartbeat 3.0.5
> and pacemaker 3.1.6 (RHEL6.2))?
> 
> Thanks very much!
> Chris


Please share your full config and the logs from both nodes through the
duration of the events.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Node lost early in HA startup --> no STONITH

Reply via email to