On Fri, 2020-02-28 at 09:37 +0100, Ulrich Windl wrote: > > > > Ken Gaillot <kgail...@redhat.com> schrieb am 27.02.2020 um > > > > 23:43 in Nachricht > > <43512a11c2ddffbabeee11cf4cb509e4e5dc98ca.ca...@redhat.com>: > > [...] > > > > > 2. Resources/groups are stopped (target-role=stopped) > > > 3. Node exits the cluster cleanly when no resources are running > > > any > > > more > > > 4. The node rejoins the cluster after the reboot > > > 5. A positive (on the rebooted node) & negative (ban on the rest > > > of > > > the nodes) constraints are created for the marked in step 1 > > > resources > > > 6. target-role is set back to started and the resources are > > > back > > > and running > > > 7. When each resource group (or standalone resource) is back > > > online > > > - the mark in step 1 is removed and any location > > > constraints (cli-ban & cli-prefer) are removed for the > > > resource/group. > > > > Exactly, that's effectively what happens. > > May I ask how robust the mechanism will be? > For example if you do a "resource restart" there are two target > roles (each made persistent): stopped and started. If the node > performing the operation is fenced (we had that a few times). The > resources may remain "stopped" until started manually again. > I see a similar issue with this mechanism.
Corner cases were carefully considered with this one. If a node is fenced, its entire CIB status section is cleared, which will include shutdown locks. I considered alternative implementations under the hood, and the main advantage of the one chosen is that setting and clearing the lock are atomic with recording the action results that cause them. That eliminates a whole lot of possibilities for the type of problem you mention. Also, there are multiple backstops to clear locks if anything is fishy, such as if the node is unclean, the resource somehow started elsewhere while the lock was in effect, a locked resource is removed from the configuration while it is down, etc. The one area I don't consider mature yet is Pacemaker Remote nodes. I'd recommend using the feature only in a cluster without them. This is due mainly to a (documented) limitation that manual lock clearing and shutdown-lock-limit only work if the remote connection is disabled after stopping the node, which sort of defeats the "hands off" goal. But also I think using locks with remote nodes requires more testing. > > [...] -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/