>>> Harvey Shepherd <harvey.sheph...@aviatnet.com> schrieb am 12.08.2019 um >>> 23:38 in Nachricht <ec767e3d-0cde-42c2-a8de-72ffce859...@email.android.com>: > I've been experiencing exactly the same issue. Pacemaker prioritises > restarting the failed resource over maintaining a master instance. In my case > I used crm_simulate to analyse the actions planned and taken by pacemaker > during resource recovery. It showed that the system did plan to failover the > master instance, but it was near the bottom of the action list. Higher > priority was given to restarting the failed instance, consequently when that > had occurred, it was easier just to promote the same instance rather than > failing over.
That's interesting: Maybe usually it's actually faster to restart a failed (master) process rather than promoting a slave to master, possibly demoting the old master to slave, etc. But most obviously while there is a (possible) resource utilization for resources, there is none for operations (AFAIK): If one could configure "operation costs" (maybe as rules), the cluster could prefer the transition with least costs. Unfortunately it will make things more complicated. I could even imagine if you set the cost for "stop" to infinity, the cluster will not even try to stop the resource, but will fence the node instead... > > This particular behaviour caused me a lot of headaches. In the end I had to > use a workaround by setting max failures for the resource to 1, and clearing > the failure after 10 seconds. This forces it to failover, but there is then a > window (longer than 10 seconds due to the cluster check timer which is used > to clear failures) where the resource can't fail back if there happened to be > a second failure. It also means that there is no slave running during this > time, which causes a performance hit in my case. > [...] _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/