On Thu, 31 May 2018 22:52:12 +0300 Andrei Borzenkov <arvidj...@gmail.com> wrote:
> 31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет: > > Sorry for getting back to you so late. > > > > On Fri, 25 May 2018 11:58:59 -0600 > > Casey & Gina <caseyandg...@icloud.com> wrote: > > > >>> On May 25, 2018, at 7:01 AM, Casey Allen Shobe <caseyandg...@icloud.com> > >>> wrote: > >>>> Actually, why is Pacemaker fencing the standby node just because a > >>>> resource fails to start there? I thought only the master should be > >>>> fenced if it were assumed to be broken. > >> > >> This is probably the most important thing to ask outside of the PAF > >> resource agent which many may not be as fluent with as pacemaker itself, > >> and perhaps the most indicative of me setting something up incorrectly > >> outside of that resource agent. > >> > >> My understanding of fencing was that pacemaker would only fence a node if > >> it was the master but had stopped responding, to avoid a split-brain > >> situation. Why would pacemaker ever fence a standby node with no resources > >> currently allocated to it? > > > > So, as discussed on IRC and for the mailing list history, here is the > > answer: > > > > https://clusterlabs.github.io/PAF/administration.html#failover > > > > In short: after a failure (either on a primary or a standby), you MUST fix > > things on the node before starting Pacemaker. > > > > If you don't, PAF will detect something incoherent and raise an error, > > leading Pacemaker to most likely fence your node, again. > > > > Well, that does not sound very polite to user :) Sure :) But at least, It's been documented as you pointed earlier. After a failure and an automatic failover, either you have some automatic failback process somewhere...or you have to fix some things around. PAF is not able to do automatic failback. > Another database RA I mentioned somewhere in this thread has different > approach - it starts database in its monitor action and start action is > effectively dummy. Mh, I would have to study that. But I'm not thrill about such behavior at a first look. > So start always succeeds from pacemaker point of > view, but database won't be started until manually synchronized again by > administrator. It seems scary...What about the stop action? What if the monitor detect an error? Well, I really should check this RA you are talking about to answer my questions. > Downside is that pacemaker resource status does not reflect database > status. I wish pacemaker supported something like "requires manual > intervention" resource state that would not be treated like error > (causing all sorts of fatal consequences) but still evaluated for > dependencies (i.e. dependent resources would not be started). That would > be ideal for such case. Good idea. I have a couple more: * handling errors from notify actions * supporting mgirate-to/from for multistate RA * having real infinite master score :) Cheers, _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org