On Mon, 2021-01-25 at 13:18 -0500, Digimer wrote: > On 2021-01-25 11:01 a.m., Ken Gaillot wrote: > > On Mon, 2021-01-25 at 09:51 +0100, Jehan-Guillaume de Rorthais > > wrote: > > > Hi Digimer, > > > > > > On Sun, 24 Jan 2021 15:31:22 -0500 > > > Digimer <li...@alteeve.ca> wrote: > > > [...] > > > > I had a test server (srv01-test) running on node 1 (el8- > > > > a01n01), > > > > and on > > > > node 2 (el8-a01n02) I ran 'pcs cluster stop --all'. > > > > > > > > It appears like pacemaker asked the VM to migrate to node 2 > > > > instead of > > > > stopping it. Once the server was on node 2, I couldn't use 'pcs > > > > resource > > > > disable <vm>' as it returned that that resource was unmanaged, > > > > and > > > > the > > > > cluster shut down was hung. When I directly stopped the VM and > > > > then > > > > did > > > > a 'pcs resource cleanup', the cluster shutdown completed. > > > > > > As actions during a cluster shutdown cannot be handled in the > > > same > > > transition > > > for each nodes, I usually add a step to disable all resources > > > using > > > property > > > "stop-all-resources" before shutting down the cluster: > > > > > > pcs property set stop-all-resources=true > > > pcs cluster stop --all > > > > > > But it seems there's a very new cluster property to handle that > > > (IIRC, one or > > > two releases ago). Look at "shutdown-lock" doc: > > > > > > [...] > > > some users prefer to make resources highly available only for > > > failures, with > > > no recovery for clean shutdowns. If this option is true, > > > resources > > > active on a > > > node when it is cleanly shut down are kept "locked" to that > > > node > > > (not allowed > > > to run elsewhere) until they start again on that node after it > > > rejoins (or > > > for at most shutdown-lock-limit, if set). > > > [...] > > > > > > [...] > > > > So as best as I can tell, pacemaker really did ask for a > > > > migration. Is > > > > this the case? > > > > > > AFAIK, yes, because each cluster shutdown request is handled > > > independently at > > > node level. There's a large door open for all kind of race > > > conditions > > > if > > > requests are handled with some random lags on each nodes. > > > > I'm going to guess that's what happened. > > > > The basic issue is that there is no "cluster shutdown" in > > Pacemaker, > > only "node shutdown". I'm guessing "pcs cluster stop --all" sends > > shutdown requests for each node in sequence (probably via systemd), > > and > > if the nodes are quick enough, one could start migrating off > > resources > > before all the others get their shutdown request. > > > > There would be a way around it. Normally Pacemaker is shut down via > > SIGTERM to pacemakerd (which is what systemctl stop does), but > > inside > > Pacemaker it's implemented as a special "shutdown" transient node > > attribute, set to the epoch timestamp of the request. It would be > > possible to set that attribute for all nodes in a copy of the CIB, > > then > > load that into the live cluster. > > > > stop-all-resources as suggested would be another way around it (and > > would have to be cleared after start-up, which could be a plus or a > > minus depending on how much control vs convenience you want). > > Thanks for your and everyone else's replies! > > I'm left curious about one part of this though; When the node > migrated, > the resource was then listed as unmanaged. So the resource was never > requested to shutdown and the cluster shutdown on that node then > hung. > > I can understand what's happening that triggered the migration, and I > can understand how to prevent it in the future. (Truth be told, the > Anvil! already would shut down all servers before calling the > pacemaker > stop, but I wanted to test possible fault conditions). > > Is it not a bug that the cluster was unable to stop after the > migration? > > If I understand what's been said in this thread, the host node got a > shutdown request so it migrated the resource. Then the peer (new > host) > would have gotten the shutdown request, should it then have seen the > peer was gone and shut the resource down? Why did it enter an > unmanaged > state? > > Cheers
There aren't many ways the cluster can change a resource to unmanaged: maintenance mode configured (on the cluster, node, or resource), a failure when on-fail=block, being multiply active with multiple- active=block, losing quorum with no-quorum-policy=freeze, or a stop failure with no ability to fence. -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/