Re: [Pacemaker] Best way to recover from failed STONITH?

Andrew Martin Fri, 21 Dec 2012 10:51:53 -0800

Andreas,

Thanks for the help. Please see my replies inline below.


----- Original Message -----
> From: "Andreas Kurz" <andr...@hastexo.com>
> To: pacemaker@oss.clusterlabs.org
> Sent: Friday, December 21, 2012 10:11:08 AM
> Subject: Re: [Pacemaker] Best way to recover from failed STONITH?
> 
> On 12/21/2012 04:18 PM, Andrew Martin wrote:
> > Hello,
> > 
> > Yesterday a power failure took out one of the nodes and its STONITH
> > device (they share an upstream power source) in a 3-node
> > active/passive cluster (Corosync 2.1.0, Pacemaker 1.1.8). After
> > logging into the cluster, I saw that the STONITH operation had
> > given up in failure and that none of the resources were running on
> > the other nodes:
> > Dec 20 17:59:14 [18909] quorumnode       crmd:   notice:
> > too_many_st_failures:       Too many failures to fence node0 (11),
> > giving up
> > 
> > I brought the failed node back online and it rejoined the cluster,
> > but no more STONITH attempts were made and the resources remained
> > stopped. Eventually I set stonith-enabled="false" ran killall on
> > all pacemaker-related processes on the other (remaining) nodes,
> > then restarted pacemaker, and the resources successfully migrated
> > to one of the other nodes. This seems like a rather invasive
> > technique. My questions about this type of situation are:
> >  - is there a better way to tell the cluster "I have manually
> >  confirmed this node is dead/safe"? I see there is the meatclient
> >  command, but can that only be used with the meatware STONITH
> >  plugin?
> 
> crm node cleanup quorumnode

I'm using the latest version of crmsh (1.2.1) but it doesn't seem to support 
this command:
root@node0:~# crm --version
1.2.1 (Build unknown)
root@node0:~# crm node
crm(live)node# help

Node management and status commands.

Available commands:

        status           show nodes' status as XML
        show             show node
        standby          put node into standby
        online           set node online
        fence            fence node
        clearstate       Clear node state
        delete           delete node
        attribute        manage attributes
        utilization      manage utilization attributes
        status-attr      manage status attributes
        help             show help (help topics for list of topics)
        end              go back one level
        quit             exit the program
Also, do I run cleanup on just the node that failed, or all of them?


> 
> >  - in general, is there a way to force the cluster to start
> >  resources, if you just need to get them back online and as a
> >  human have confirmed that things are okay? Something like crm
> >  resource start rsc --force?
> 
> ... see above ;-)

On a related note, is there a way to way to get better information
about why the cluster is in its current state? For example, in this 
situation it would be nice to be able to run a command and have the
cluster print "resources stopped until node XXX can be fenced" to
be able to quickly assess the problem with the cluster.

> 
> >  - how can I completely clear out saved data for the cluster and
> >  start over from scratch (last-resort option)? Stopping pacemaker
> >  and removing everything from /var/lib/pacemaker/cib and
> >  /var/lib/pacemaker/pengine cleans the CIB, but the nodes end up
> >  sitting in the "pending" state for a very long time (30 minutes
> >  or more). Am I missing another directory that needs to be
> >  cleared?
> 
> you started with an completely empty cib and the two (or three?)
> nodes
> needed 30min to form a cluster?
Yes, in fact I cleared out both /var/lib/pacemaker/cib and 
/var/lib/pacemaker/pengine
several times and most of the times after starting pacemaker again
one node would become "online" pretty quickly (less than 5 minutes), but the 
other two 
would remain "pending" for quite some time. I left it going overnight
and this morning all of the nodes
> 
> > 
> > I am going to look into making the power source for the STONITH
> > device independent of the power source for the node itself,
> > however even with that setup there's still a chance that something
> > could take out both power sources at the same time, in which case
> > manual intervention and confirmation that the node is dead would
> > be required.
> 
> Pacemaker 1.1.8 supports (again) stonith topologies ... so more than
> one
> fencing device and they can be "logically" combined.

Where can I find documentation on STONITH topologies and configuring
more than one fencing device for a single node? I don't see it mentioned
in the Cluster Labs documentation (Clusters from Scratch or Pacemaker 
Explained).

Thanks,

Andrew

> 
> Regards,
> Andreas
> 
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
> 
> > 
> > Thanks,
> > 
> > Andrew
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Best way to recover from failed STONITH?

Reply via email to