Re: [ClusterLabs] Feedback wanted: Node reaction to fabric fencing
On 24/07/19 12:33 -0500, Ken Gaillot wrote: > A recent bugfix (clbz#5386) brings up a question. > > A node may receive notification of its own fencing when fencing is > misconfigured (for example, an APC switch with the wrong plug number) > or when fabric fencing is used that doesn't cut the cluster network > (for example, fence_scsi). One related idea that'd be better to think through on its own pace, whether it would make sense to maximize the benefit of knowing which kind of behaviour to expect from particular abstracted fencing device. Is it absolute cut-off of the whole node's acting, or is it just a partial isolation where it presumably matters the most (access to disk, access to network resources, ...)? Then, a dichotomy in failure modes could be introduced, since these are effectively _different_ disaster limiting scenarios with different pros and cons (consider also debug-ability). I always had mixed feelings about putting total/partial fencing into the same bucket. Apparently, that information would need to be propagated via the metadata of the agents, meaning pulling more complexity on that level. Broader picture might even be that compositions of the resources could as well point out which kinds of shared resources are in danger of amplifying the failure/causing split brain etc. and hence offer the feedback which of these are yet to be covered if the absolute cut-off is not preferred/available for whatever reason. /me gets back from daydreaming -- Poki pgpoaxij46hZj.pgp Description: PGP signature ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Feedback wanted: Node reaction to fabric fencing
On Thu, Jul 25, 2019 at 3:20 AM Ondrej wrote: > > Is there any plan on getting this also into 1.1 branch? > If yes, then I would be for just introducing the configuration option in > 1.1.x with default to 'stop'. > +1 for back porting it from someone who just recently hit this (puzzling) behavior. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Feedback wanted: Node reaction to fabric fencing
On 7/25/19 1:33 AM, Ken Gaillot wrote: > Hi all, > > A recent bugfix (clbz#5386) brings up a question. > > A node may receive notification of its own fencing when fencing is > misconfigured (for example, an APC switch with the wrong plug number) > or when fabric fencing is used that doesn't cut the cluster network > (for example, fence_scsi). > > Previously, the *intended* behavior was for the node to attempt to > reboot itself in that situation, falling back to stopping pacemaker if > that failed. However, due to the bug, the reboot always failed, so the > behavior effectively was to stop pacemaker. > > Now that the bug is fixed, the node will indeed reboot in that > situation. > > It occurred to me that some users configure fabric fencing specifically > so that nodes aren't ever intentionally rebooted. Therefore, I intend > to make this behavior configurable. > > My question is, what do you think the default should be? > > 1. Default to the correct behavior (reboot) > > 2. Default to the current behavior (stop) > > 3. Default to the current behavior for now, and change it to the > correct behavior whenever pacemaker 2.1 is released (probably a few > years from now) > Sounds, 3) is the best choice. Make it configurable, and keep the current behavior(stop) for backward compatibility for the current minor version, eg. next 2.0.z(3+). Well, the correct behavior (reboot) as the default should be enforced. It should be the same crucial as stop failures of a resource. Make sense in the next minor version, say, 2.1. Thanks, Roger ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Feedback wanted: Node reaction to fabric fencing
On 7/25/19 2:33 AM, Ken Gaillot wrote: Hi all, A recent bugfix (clbz#5386) brings up a question. A node may receive notification of its own fencing when fencing is misconfigured (for example, an APC switch with the wrong plug number) or when fabric fencing is used that doesn't cut the cluster network (for example, fence_scsi). Previously, the *intended* behavior was for the node to attempt to reboot itself in that situation, falling back to stopping pacemaker if that failed. However, due to the bug, the reboot always failed, so the behavior effectively was to stop pacemaker. Now that the bug is fixed, the node will indeed reboot in that situation. It occurred to me that some users configure fabric fencing specifically so that nodes aren't ever intentionally rebooted. Therefore, I intend to make this behavior configurable. My question is, what do you think the default should be? 1. Default to the correct behavior (reboot) 2. Default to the current behavior (stop) 3. Default to the current behavior for now, and change it to the correct behavior whenever pacemaker 2.1 is released (probably a few years from now) As long as there is option to change it I'm OK with change from next minor(?) version (2.0.3) to 'reboot'. But it should be pointed out in RC stage that this is going to occur and to get ready for it. Is there any plan on getting this also into 1.1 branch? If yes, then I would be for just introducing the configuration option in 1.1.x with default to 'stop'. -- Ondrej ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Feedback wanted: Node reaction to fabric fencing
Hi all, A recent bugfix (clbz#5386) brings up a question. A node may receive notification of its own fencing when fencing is misconfigured (for example, an APC switch with the wrong plug number) or when fabric fencing is used that doesn't cut the cluster network (for example, fence_scsi). Previously, the *intended* behavior was for the node to attempt to reboot itself in that situation, falling back to stopping pacemaker if that failed. However, due to the bug, the reboot always failed, so the behavior effectively was to stop pacemaker. Now that the bug is fixed, the node will indeed reboot in that situation. It occurred to me that some users configure fabric fencing specifically so that nodes aren't ever intentionally rebooted. Therefore, I intend to make this behavior configurable. My question is, what do you think the default should be? 1. Default to the correct behavior (reboot) 2. Default to the current behavior (stop) 3. Default to the current behavior for now, and change it to the correct behavior whenever pacemaker 2.1 is released (probably a few years from now) -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/