On Fri, 2021-11-05 at 11:22 +0300, Andrei Borzenkov wrote: > On 05.11.2021 01:20, Ken Gaillot wrote: > > > There are two issues discussed in this thread. > > > > > > 1. Remote node is fenced when connection with this node is lost. > > > For > > > all > > > I can tell this is intended and expected behavior. That was the > > > original > > > question. > > > > It's expected only because the connection can't be recovered > > elsewhere. > > If another node can run the connection, pacemaker will try to > > reconnect > > from there and re-probe everything to make sure what the current > > state > > is. > > > > That's not what I see in sources and documentation and not what I > obverse. Pacemaker will reprobe from another node only after > attempting > fencing of remote node.
Ah, you're right, I misremembered. Probe/start failures of a remote connection don't require fencing but recurring monitor failures do. I guess that makes sense, otherwise recovery of resources on a failed remote could be greatly delayed. I was confusing that with when the connection host is lost and has to be fenced, in which case the connection will be recovered elsewhere if possible, without fencing the remote. <snip> > The difference seems to be reconnect_interval parameter. If it is > present in remote resource definition, pacemaker will not proceed > after > failed fencing. > > As there is no real documentation how it is supposed to work I do not > know whether all of this is a bug or not. But one is certainly sure - > when connection to remote node is lost the first thing pacemaker does > is > to fence it and only then initiate any recovery action. reconnect_interval is implemented as a sort of special case of failure- timeout. When the interval expires, the connection failure is timed out, so the cluster no longer sees a need for fencing. It's not a bug but maybe a questionable design. That's a case of a broader problem: if the cause for fencing goes away, the cluster will stop trying fencing and act as if nothing was wrong. This can be a good thing, for example a brief network interruption can sometimes heal without fencing. However it's been suggested (e.g. CLBZ#5476) that we need the concept of fencing required independently of conditions -- i.e., for certain types of failure, fencing should be considered required until it succeeds, regardless of whether the original need for it goes away. -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/