On 2021-01-25 3:58 p.m., Ken Gaillot wrote: > On Mon, 2021-01-25 at 13:18 -0500, Digimer wrote: >> On 2021-01-25 11:01 a.m., Ken Gaillot wrote: >>> On Mon, 2021-01-25 at 09:51 +0100, Jehan-Guillaume de Rorthais >>> wrote: >>>> Hi Digimer, >>>> >>>> On Sun, 24 Jan 2021 15:31:22 -0500 >>>> Digimer <li...@alteeve.ca> wrote: >>>> [...] >>>>> I had a test server (srv01-test) running on node 1 (el8- >>>>> a01n01), >>>>> and on >>>>> node 2 (el8-a01n02) I ran 'pcs cluster stop --all'. >>>>> >>>>> It appears like pacemaker asked the VM to migrate to node 2 >>>>> instead of >>>>> stopping it. Once the server was on node 2, I couldn't use 'pcs >>>>> resource >>>>> disable <vm>' as it returned that that resource was unmanaged, >>>>> and >>>>> the >>>>> cluster shut down was hung. When I directly stopped the VM and >>>>> then >>>>> did >>>>> a 'pcs resource cleanup', the cluster shutdown completed. >>>> >>>> As actions during a cluster shutdown cannot be handled in the >>>> same >>>> transition >>>> for each nodes, I usually add a step to disable all resources >>>> using >>>> property >>>> "stop-all-resources" before shutting down the cluster: >>>> >>>> pcs property set stop-all-resources=true >>>> pcs cluster stop --all >>>> >>>> But it seems there's a very new cluster property to handle that >>>> (IIRC, one or >>>> two releases ago). Look at "shutdown-lock" doc: >>>> >>>> [...] >>>> some users prefer to make resources highly available only for >>>> failures, with >>>> no recovery for clean shutdowns. If this option is true, >>>> resources >>>> active on a >>>> node when it is cleanly shut down are kept "locked" to that >>>> node >>>> (not allowed >>>> to run elsewhere) until they start again on that node after it >>>> rejoins (or >>>> for at most shutdown-lock-limit, if set). >>>> [...] >>>> >>>> [...] >>>>> So as best as I can tell, pacemaker really did ask for a >>>>> migration. Is >>>>> this the case? >>>> >>>> AFAIK, yes, because each cluster shutdown request is handled >>>> independently at >>>> node level. There's a large door open for all kind of race >>>> conditions >>>> if >>>> requests are handled with some random lags on each nodes. >>> >>> I'm going to guess that's what happened. >>> >>> The basic issue is that there is no "cluster shutdown" in >>> Pacemaker, >>> only "node shutdown". I'm guessing "pcs cluster stop --all" sends >>> shutdown requests for each node in sequence (probably via systemd), >>> and >>> if the nodes are quick enough, one could start migrating off >>> resources >>> before all the others get their shutdown request. >>> >>> There would be a way around it. Normally Pacemaker is shut down via >>> SIGTERM to pacemakerd (which is what systemctl stop does), but >>> inside >>> Pacemaker it's implemented as a special "shutdown" transient node >>> attribute, set to the epoch timestamp of the request. It would be >>> possible to set that attribute for all nodes in a copy of the CIB, >>> then >>> load that into the live cluster. >>> >>> stop-all-resources as suggested would be another way around it (and >>> would have to be cleared after start-up, which could be a plus or a >>> minus depending on how much control vs convenience you want). >> >> Thanks for your and everyone else's replies! >> >> I'm left curious about one part of this though; When the node >> migrated, >> the resource was then listed as unmanaged. So the resource was never >> requested to shutdown and the cluster shutdown on that node then >> hung. >> >> I can understand what's happening that triggered the migration, and I >> can understand how to prevent it in the future. (Truth be told, the >> Anvil! already would shut down all servers before calling the >> pacemaker >> stop, but I wanted to test possible fault conditions). >> >> Is it not a bug that the cluster was unable to stop after the >> migration? >> >> If I understand what's been said in this thread, the host node got a >> shutdown request so it migrated the resource. Then the peer (new >> host) >> would have gotten the shutdown request, should it then have seen the >> peer was gone and shut the resource down? Why did it enter an >> unmanaged >> state? >> >> Cheers > > There aren't many ways the cluster can change a resource to unmanaged: > maintenance mode configured (on the cluster, node, or resource), a > failure when on-fail=block, being multiply active with multiple- > active=block, losing quorum with no-quorum-policy=freeze, or a stop > failure with no ability to fence.
Sorry, let me clarify; The resource was managed when I called 'pcs cluster stop --all', so something in the background set it to 'unmanaged'. I suppose I would need to look in the logs... -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/