On 2021-01-26 11:27 a.m., Ken Gaillot wrote: > On Tue, 2021-01-26 at 11:03 -0500, Digimer wrote: >> On 2021-01-26 10:15 a.m., Tomas Jelinek wrote: >>> Dne 25. 01. 21 v 17:01 Ken Gaillot napsal(a): >>>> On Mon, 2021-01-25 at 09:51 +0100, Jehan-Guillaume de Rorthais >>>> wrote: >>>>> Hi Digimer, >>>>> >>>>> On Sun, 24 Jan 2021 15:31:22 -0500 >>>>> Digimer <li...@alteeve.ca> wrote: >>>>> [...] >>>>>> I had a test server (srv01-test) running on node 1 (el8- >>>>>> a01n01), >>>>>> and on >>>>>> node 2 (el8-a01n02) I ran 'pcs cluster stop --all'. >>>>>> >>>>>> It appears like pacemaker asked the VM to migrate to node >>>>>> 2 >>>>>> instead of >>>>>> stopping it. Once the server was on node 2, I couldn't use >>>>>> 'pcs >>>>>> resource >>>>>> disable <vm>' as it returned that that resource was >>>>>> unmanaged, and >>>>>> the >>>>>> cluster shut down was hung. When I directly stopped the VM >>>>>> and then >>>>>> did >>>>>> a 'pcs resource cleanup', the cluster shutdown completed. >>>>> >>>>> As actions during a cluster shutdown cannot be handled in the >>>>> same >>>>> transition >>>>> for each nodes, I usually add a step to disable all resources >>>>> using >>>>> property >>>>> "stop-all-resources" before shutting down the cluster: >>>>> >>>>> pcs property set stop-all-resources=true >>>>> pcs cluster stop --all >>>>> >>>>> But it seems there's a very new cluster property to handle that >>>>> (IIRC, one or >>>>> two releases ago). Look at "shutdown-lock" doc: >>>>> >>>>> [...] >>>>> some users prefer to make resources highly available only >>>>> for >>>>> failures, with >>>>> no recovery for clean shutdowns. If this option is true, >>>>> resources >>>>> active on a >>>>> node when it is cleanly shut down are kept "locked" to that >>>>> node >>>>> (not allowed >>>>> to run elsewhere) until they start again on that node after >>>>> it >>>>> rejoins (or >>>>> for at most shutdown-lock-limit, if set). >>>>> [...] >>>>> >>>>> [...] >>>>>> So as best as I can tell, pacemaker really did ask for a >>>>>> migration. Is >>>>>> this the case? >>>>> >>>>> AFAIK, yes, because each cluster shutdown request is handled >>>>> independently at >>>>> node level. There's a large door open for all kind of race >>>>> conditions >>>>> if >>>>> requests are handled with some random lags on each nodes. >>>> >>>> I'm going to guess that's what happened. >>>> >>>> The basic issue is that there is no "cluster shutdown" in >>>> Pacemaker, >>>> only "node shutdown". I'm guessing "pcs cluster stop --all" sends >>>> shutdown requests for each node in sequence (probably via >>>> systemd), and >>>> if the nodes are quick enough, one could start migrating off >>>> resources >>>> before all the others get their shutdown request. >>> >>> Pcs is doing its best to stop nodes in parallel. The first >>> implementation of this was done back in 2015: >>> https://bugzilla.redhat.com/show_bug.cgi?id=1180506 >>> Since then, we moved to using curl for network communication, which >>> also >>> handles parallel cluster stop. Obviously, this doesn't ensure the >>> stop >>> command arrives to and is processed on all nodes at the exactly >>> same time. >>> >>> Basically, pcs sends 'stop pacemaker' request to all nodes in >>> parallel >>> and waits for it to finish on all nodes. Then it sends 'stop >>> corosync' >>> request to all nodes in parallel. The actual stopping on each node >>> is >>> done by 'systemctl stop'. >>> >>> Yes, the nodes which get the request sooner may start migrating >>> resources. >>> >>> Regards, >>> Tomas >> >> Given the case I had, where a resource went unmanaged and the stop >> hung >> indefinitely, would that be considered a bug? > > That depends on why. You'll have to check the logs around that time to > see if there are any details. It would be considered appropriate if > e.g. an action with on-fail=block failed.
OK, I'll try to reproduce and, if I can, post the logs. -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/