On 09/09/2016 08:52 AM, Auer, Jens wrote: > Hi, > > a client asked me to describe the conditions when Pacemaker uses STONITH > to bring the cluster into a known state. The documentation says that > this happens when "we cannot establish with certainty a state of some > node or resource", but I need some more concrete explanations. > Specifically, he is wondering what happens when > 1. a resource, e.g. a virtual ip, fails to often > 2. the heartbeats of one of the cluster nodes are not received anymore > 3. combinations of these two > > Is there some better definition of the conditions which trigger STONITH?
To state the obvious, just for completeness: Before stonith/fencing can be used at all, stonith-enabled must be true, working fence devices must be configured, and each node must be able to be targeted by at least one fence device. If fence device failure is a concern, a fencing topology should be used with multiple devices. The fencing setup should be verified by testing before going into production. Assuming that's in place, fencing can be used in the following situations: * Most importantly, if corosync communication is broken between nodes, fencing will be attempted. If no-quorum-policy=ignore, each partition will attempt to fence the other (in two-node clusters, a fence delay is commonly used on one node to avoid a death match here); otherwise, the partition with quorum will try to fence the partition without quorum. This can happen due to a node or nodes crashing, being under extreme load, losing network connectivity, etc. Options in corosync.conf can affect how long it takes to detect an outage, etc. * If no-quorum-policy=suicide, and one or more nodes are separated from the rest of the cluster such that they lose quorum, they will fence themselves. * If startup-fencing=true (the default), and some nodes are not present when the cluster first starts, those nodes will be fenced. * If a resource operation has on-fail=fence, and it fails, the cluster will fence the node that had the failure. Note that on-fail defaults to fence for stop operations, since if we can't stop a resource, we can't recover it elsewhere. * If someone/something explicitly requests fencing via the stonithd API (for example, "stonith_admin -F <node>"), then of course the node will be fenced. Some software, such as DRBD and DLM, can be configured to use pacemaker's fencing, so fencing might be triggered by them under their own conditions. * In a multi-site cluster using booth, if a ticket constraint has loss-policy=fence and the ticket is lost, the cluster will fence the nodes that were running the resources associated with the ticket. I may be forgetting some, but that's the most important. How the cluster responds to resource-level failures (as opposed to losing an entire node) depends on the configuration, but unless you've configured on-fail=fence, fencing won't be involved. See the documentation for the migration-threshold and on-fail parameters. > Best wishes, > jens > > -- > *Jens Auer *| CGI | Software-Engineer > CGI (Germany) GmbH & Co. KG > Rheinstraße 95 | 64295 Darmstadt | Germany > T: +49 6151 36860 154 > _jens.auer@cgi.com_ <mailto:jens.a...@cgi.com> > Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie > unter _de.cgi.com/pflichtangaben_ <http://de.cgi.com/pflichtangaben>. _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org