On 04/22/2017 09:20 AM, Digimer wrote: > On 22/04/17 03:05 AM, Andrei Borzenkov wrote: >> 18.04.2017 10:47, Ulrich Windl пишет: >> ... >>>> Now let me come back to quorum vs. stonith; >>>> >>>> Said simply; Quorum is a tool for when everything is working. Fencing is >>>> a tool for when things go wrong. >>> I'd say: Quorum is the tool to decide who'll be alive and who's going to >>> die, >>> and STONITH is the tool to make nodes die. >> If I had PROD, QA and DEV in a cluster and PROD were separated from >> QA+DEV I'd be very sad if PROD were shut down. >> >> The notion of simple node majority as kill policy is not appropriate as >> well as simple node based delays. I wish pacemaker supported scoring >> system for resources so that we could base stonith delays on them (the >> most important sub-cluster starts fencing first). >> >> >>> If everything is working you need >>> neither quorum nor STONITH. >>> >> I wonder how SBD fits into this discussion. It is marketed as stonith >> agent, but it is based on committing suicide so relies on well-behaving >> nodes. Which we by definition cannot trust to behave well, otherwise >> we'd not need stonith in the first place. > The logic, when using a watchdog timer, is that if the node is alive > enough to kick the watchdog, it's alive enough to not do something dumb > to the cluster. If it's not able to kick the timer, the watchdog timer > will reset the machine. This works *if* all resources hang when messages > stop coming back from the peer (a side effect of corosync's virtual > synchrony).
In fact watchdog-implementations (meaning the software that kicks the hardware-watchdog) are a little bit smarter - and so is SBD. By having the watchdog-kicking and observation-code in a simple loop that is executed periodically you don't need the 'if it is alive enough to do the kicking it will behave well' paradigm. This burns down to making the critical part of the code very small and on top hard to control failures that result in any kind of hanging don't bother us. > > So as I understand it, for SBD to be safe, it requires a hardware > watchdog timer and a properly configured cluster. Yes, yes and yes ... as important as fencing I would say ;-) Regards, Klaus > -- Klaus Wenninger Senior Software Engineer, EMEA ENG Openstack Infrastructure Red Hat kwenn...@redhat.com _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org