Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)
> Well, that does not sound very polite to user :) The thing that really threw me off was pacemaker rebooting the node as soon as I'd try to start the cluster on it without the database running. Is there a way to prevent this from happening? Some way to indicate to Pacemaker, "Hey, I'm not willing/able to start the resource here because it appears to be in a corrupt state", while not causing the node to be fenced because it thinks that the resource is running when it isn't? It would be perfectly safe to not fence the node, in this case... -- Casey ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)
On Thu, 2018-05-31 at 22:43 +0200, Jehan-Guillaume de Rorthais wrote: > On Thu, 31 May 2018 22:52:12 +0300 > Andrei Borzenkov wrote: > > > 31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет: > > > Sorry for getting back to you so late. > > > > > > On Fri, 25 May 2018 11:58:59 -0600 > > > Casey & Gina wrote: > > > > > > > > On May 25, 2018, at 7:01 AM, Casey Allen Shobe > > > > icloud.com> > > > > > wrote: > > > > > > Actually, why is Pacemaker fencing the standby node just > > > > > > because a > > > > > > resource fails to start there? I thought only the master > > > > > > should be > > > > > > fenced if it were assumed to be broken. > > > > > > > > This is probably the most important thing to ask outside of the > > > > PAF > > > > resource agent which many may not be as fluent with as > > > > pacemaker itself, > > > > and perhaps the most indicative of me setting something up > > > > incorrectly > > > > outside of that resource agent. > > > > > > > > My understanding of fencing was that pacemaker would only fence > > > > a node if > > > > it was the master but had stopped responding, to avoid a split- > > > > brain > > > > situation. Why would pacemaker ever fence a standby node with > > > > no resources > > > > currently allocated to it? > > > > > > So, as discussed on IRC and for the mailing list history, here is > > > the > > > answer: > > > > > > https://clusterlabs.github.io/PAF/administration.html#failover > > > > > > In short: after a failure (either on a primary or a standby), you > > > MUST fix > > > things on the node before starting Pacemaker. > > > > > > If you don't, PAF will detect something incoherent and raise an > > > error, > > > leading Pacemaker to most likely fence your node, again. > > > > > > > Well, that does not sound very polite to user :) > > Sure :) > > But at least, It's been documented as you pointed earlier. > > After a failure and an automatic failover, either you have some > automatic > failback process somewhere...or you have to fix some things around. > > PAF is not able to do automatic failback. > > > Another database RA I mentioned somewhere in this thread has > > different > > approach - it starts database in its monitor action and start > > action is > > effectively dummy. > > Mh, I would have to study that. But I'm not thrill about such > behavior at a > first look. > > > So start always succeeds from pacemaker point of > > view, but database won't be started until manually synchronized > > again by > > administrator. > > It seems scary...What about the stop action? What if the monitor > detect an > error? Well, I really should check this RA you are talking about to > answer my > questions. > > > Downside is that pacemaker resource status does not reflect > > database > > status. I wish pacemaker supported something like "requires manual > > intervention" resource state that would not be treated like error > > (causing all sorts of fatal consequences) but still evaluated for > > dependencies (i.e. dependent resources would not be started). That > > would > > be ideal for such case. I'm not clear what such a result would mean. Is the goal to stop dependent resources, but not the resource itself? And/or to block all further management of the resource? > Good idea. > > I have a couple more: > * handling errors from notify actions I could imagine notify supporting on-fail, defaulting to ignore. Would that do what you want? Should notify errors count toward the resource fail count? > * supporting migrate-to/from for multistate RA > * having real infinite master score :) What behavior isn't supported by current infinity? > > Cheers, -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)
On Thu, 31 May 2018 22:52:12 +0300 Andrei Borzenkov wrote: > 31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет: > > Sorry for getting back to you so late. > > > > On Fri, 25 May 2018 11:58:59 -0600 > > Casey & Gina wrote: > > > >>> On May 25, 2018, at 7:01 AM, Casey Allen Shobe > >>> wrote: > Actually, why is Pacemaker fencing the standby node just because a > resource fails to start there? I thought only the master should be > fenced if it were assumed to be broken. > >> > >> This is probably the most important thing to ask outside of the PAF > >> resource agent which many may not be as fluent with as pacemaker itself, > >> and perhaps the most indicative of me setting something up incorrectly > >> outside of that resource agent. > >> > >> My understanding of fencing was that pacemaker would only fence a node if > >> it was the master but had stopped responding, to avoid a split-brain > >> situation. Why would pacemaker ever fence a standby node with no resources > >> currently allocated to it? > > > > So, as discussed on IRC and for the mailing list history, here is the > > answer: > > > > https://clusterlabs.github.io/PAF/administration.html#failover > > > > In short: after a failure (either on a primary or a standby), you MUST fix > > things on the node before starting Pacemaker. > > > > If you don't, PAF will detect something incoherent and raise an error, > > leading Pacemaker to most likely fence your node, again. > > > > Well, that does not sound very polite to user :) Sure :) But at least, It's been documented as you pointed earlier. After a failure and an automatic failover, either you have some automatic failback process somewhere...or you have to fix some things around. PAF is not able to do automatic failback. > Another database RA I mentioned somewhere in this thread has different > approach - it starts database in its monitor action and start action is > effectively dummy. Mh, I would have to study that. But I'm not thrill about such behavior at a first look. > So start always succeeds from pacemaker point of > view, but database won't be started until manually synchronized again by > administrator. It seems scary...What about the stop action? What if the monitor detect an error? Well, I really should check this RA you are talking about to answer my questions. > Downside is that pacemaker resource status does not reflect database > status. I wish pacemaker supported something like "requires manual > intervention" resource state that would not be treated like error > (causing all sorts of fatal consequences) but still evaluated for > dependencies (i.e. dependent resources would not be started). That would > be ideal for such case. Good idea. I have a couple more: * handling errors from notify actions * supporting mgirate-to/from for multistate RA * having real infinite master score :) Cheers, ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)
31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет: > Sorry for getting back to you so late. > > On Fri, 25 May 2018 11:58:59 -0600 > Casey & Gina wrote: > >>> On May 25, 2018, at 7:01 AM, Casey Allen Shobe >>> wrote: Actually, why is Pacemaker fencing the standby node just because a resource fails to start there? I thought only the master should be fenced if it were assumed to be broken. >> >> This is probably the most important thing to ask outside of the PAF resource >> agent which many may not be as fluent with as pacemaker itself, and perhaps >> the most indicative of me setting something up incorrectly outside of that >> resource agent. >> >> My understanding of fencing was that pacemaker would only fence a node if it >> was the master but had stopped responding, to avoid a split-brain situation. >> Why would pacemaker ever fence a standby node with no resources currently >> allocated to it? > > So, as discussed on IRC and for the mailing list history, here is the answer: > > https://clusterlabs.github.io/PAF/administration.html#failover > > In short: after a failure (either on a primary or a standby), you MUST fix > things on the node before starting Pacemaker. > > If you don't, PAF will detect something incoherent and raise an error, leading > Pacemaker to most likely fence your node, again. > Well, that does not sound very polite to user :) Another database RA I mentioned somewhere in this thread has different approach - it starts database in its monitor action and start action is effectively dummy. So start always succeeds from pacemaker point of view, but database won't be started until manually synchronized again by administrator. Downside is that pacemaker resource status does not reflect database status. I wish pacemaker supported something like "requires manual intervention" resource state that would not be treated like error (causing all sorts of fatal consequences) but still evaluated for dependencies (i.e. dependent resources would not be started). That would be ideal for such case. ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)
Sorry for getting back to you so late. On Fri, 25 May 2018 11:58:59 -0600 Casey & Gina wrote: > > On May 25, 2018, at 7:01 AM, Casey Allen Shobe > > wrote: > >> Actually, why is Pacemaker fencing the standby node just because a > >> resource fails to start there? I thought only the master should be fenced > >> if it were assumed to be broken. > > This is probably the most important thing to ask outside of the PAF resource > agent which many may not be as fluent with as pacemaker itself, and perhaps > the most indicative of me setting something up incorrectly outside of that > resource agent. > > My understanding of fencing was that pacemaker would only fence a node if it > was the master but had stopped responding, to avoid a split-brain situation. > Why would pacemaker ever fence a standby node with no resources currently > allocated to it? So, as discussed on IRC and for the mailing list history, here is the answer: https://clusterlabs.github.io/PAF/administration.html#failover In short: after a failure (either on a primary or a standby), you MUST fix things on the node before starting Pacemaker. If you don't, PAF will detect something incoherent and raise an error, leading Pacemaker to most likely fence your node, again. As instance, after a primary crash, you will have to resync it as a standby with the new master before starting Pacemaker on the node and giving PAF the relay. It is actually really important if you don't want to end up with a silently corrupted standby in your cluster. Cheers, ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)
> On May 25, 2018, at 7:01 AM, Casey Allen Shobe > wrote: > >> Actually, why is Pacemaker fencing the standby node just because a resource >> fails to start there? I thought only the master should be fenced if it were >> assumed to be broken. This is probably the most important thing to ask outside of the PAF resource agent which many may not be as fluent with as pacemaker itself, and perhaps the most indicative of me setting something up incorrectly outside of that resource agent. My understanding of fencing was that pacemaker would only fence a node if it was the master but had stopped responding, to avoid a split-brain situation. Why would pacemaker ever fence a standby node with no resources currently allocated to it? Regards, -- Casey ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org