Re: [ClusterLabs] Feedback wanted: Node reaction to fabric fencing

2019-07-25 Thread Jan Pokorný
On 24/07/19 12:33 -0500, Ken Gaillot wrote:
> A recent bugfix (clbz#5386) brings up a question.
> 
> A node may receive notification of its own fencing when fencing is
> misconfigured (for example, an APC switch with the wrong plug number)
> or when fabric fencing is used that doesn't cut the cluster network
> (for example, fence_scsi).

One related idea that'd be better to think through on its own pace,
whether it would make sense to maximize the benefit of knowing
which kind of behaviour to expect from particular abstracted
fencing device.  Is it absolute cut-off of the whole node's acting,
or is it just a partial isolation where it presumably matters
the most (access to disk, access to network resources, ...)?
Then, a dichotomy in failure modes could be introduced, since
these are effectively _different_ disaster limiting scenarios
with different pros and cons (consider also debug-ability).
I always had mixed feelings about putting total/partial fencing
into the same bucket.  Apparently, that information would need
to be propagated via the metadata of the agents, meaning pulling
more complexity on that level.

Broader picture might even be that compositions of the resources
could as well point out which kinds of shared resources are in
danger of amplifying the failure/causing split brain etc. and
hence offer the feedback which of these are yet to be covered
if the absolute cut-off is not preferred/available for whatever
reason.

/me gets back from daydreaming

-- 
Poki


pgpoaxij46hZj.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Feedback wanted: Node reaction to fabric fencing

2019-07-25 Thread Andrei Borzenkov
On Thu, Jul 25, 2019 at 3:20 AM Ondrej  wrote:
>
> Is there any plan on getting this also into 1.1 branch?
> If yes, then I would be for just introducing the configuration option in
> 1.1.x with default to 'stop'.
>

+1 for back porting it from someone who just recently hit this
(puzzling) behavior.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Feedback wanted: Node reaction to fabric fencing

2019-07-25 Thread Roger Zhou


On 7/25/19 1:33 AM, Ken Gaillot wrote:
> Hi all,
> 
> A recent bugfix (clbz#5386) brings up a question.
> 
> A node may receive notification of its own fencing when fencing is
> misconfigured (for example, an APC switch with the wrong plug number)
> or when fabric fencing is used that doesn't cut the cluster network
> (for example, fence_scsi).
> 
> Previously, the *intended* behavior was for the node to attempt to
> reboot itself in that situation, falling back to stopping pacemaker if
> that failed. However, due to the bug, the reboot always failed, so the
> behavior effectively was to stop pacemaker.
> 
> Now that the bug is fixed, the node will indeed reboot in that
> situation.
> 
> It occurred to me that some users configure fabric fencing specifically
> so that nodes aren't ever intentionally rebooted. Therefore, I intend
> to make this behavior configurable.
> 
> My question is, what do you think the default should be?
> 
> 1. Default to the correct behavior (reboot)
> 
> 2. Default to the current behavior (stop)
> 
> 3. Default to the current behavior for now, and change it to the
> correct behavior whenever pacemaker 2.1 is released (probably a few
> years from now)
> 

Sounds, 3) is the best choice.

Make it configurable, and keep the current behavior(stop) for backward 
compatibility for the current minor version, eg. next 2.0.z(3+).

Well, the correct behavior (reboot) as the default should be enforced. 
It should be the same crucial as stop failures of a resource. Make sense 
in the next minor version, say, 2.1.

Thanks,
Roger




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Feedback wanted: Node reaction to fabric fencing

2019-07-24 Thread Ondrej

On 7/25/19 2:33 AM, Ken Gaillot wrote:

Hi all,

A recent bugfix (clbz#5386) brings up a question.

A node may receive notification of its own fencing when fencing is
misconfigured (for example, an APC switch with the wrong plug number)
or when fabric fencing is used that doesn't cut the cluster network
(for example, fence_scsi).

Previously, the *intended* behavior was for the node to attempt to
reboot itself in that situation, falling back to stopping pacemaker if
that failed. However, due to the bug, the reboot always failed, so the
behavior effectively was to stop pacemaker.

Now that the bug is fixed, the node will indeed reboot in that
situation.

It occurred to me that some users configure fabric fencing specifically
so that nodes aren't ever intentionally rebooted. Therefore, I intend
to make this behavior configurable.

My question is, what do you think the default should be?

1. Default to the correct behavior (reboot)

2. Default to the current behavior (stop)

3. Default to the current behavior for now, and change it to the
correct behavior whenever pacemaker 2.1 is released (probably a few
years from now)



As long as there is option to change it I'm OK with change from next 
minor(?) version (2.0.3) to 'reboot'. But it should be pointed out in RC 
stage that this is going to occur and to get ready for it.


Is there any plan on getting this also into 1.1 branch?
If yes, then I would be for just introducing the configuration option in 
1.1.x with default to 'stop'.


--
Ondrej
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Feedback wanted: Node reaction to fabric fencing

2019-07-24 Thread Ken Gaillot
Hi all,

A recent bugfix (clbz#5386) brings up a question.

A node may receive notification of its own fencing when fencing is
misconfigured (for example, an APC switch with the wrong plug number)
or when fabric fencing is used that doesn't cut the cluster network
(for example, fence_scsi).

Previously, the *intended* behavior was for the node to attempt to
reboot itself in that situation, falling back to stopping pacemaker if
that failed. However, due to the bug, the reboot always failed, so the
behavior effectively was to stop pacemaker.

Now that the bug is fixed, the node will indeed reboot in that
situation.

It occurred to me that some users configure fabric fencing specifically
so that nodes aren't ever intentionally rebooted. Therefore, I intend
to make this behavior configurable.

My question is, what do you think the default should be?

1. Default to the correct behavior (reboot)

2. Default to the current behavior (stop)

3. Default to the current behavior for now, and change it to the
correct behavior whenever pacemaker 2.1 is released (probably a few
years from now)
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/