On Wed, Mar 2, 2011 at 9:05 AM, Stallmann, Andreas <astallm...@conet.de> wrote:
> Hi Andrew,
>
>>> If "suicide" is no supported fencing option, why is it still included with 
>>> stonith?
>> Left over from heartbeat v1 days I guess.
>> Could also be a testing-only device like ssh.
>
> www.clusterlabs.org tells me, you're the Pacemaker project leader.

Yes, but the stonith devices come from cluster-glue.
So I guess Dejan or Florian are nominally in charge of those, but
they've not been changed in forever.

> Would you, by chance, know who maintains or maintained the 
> suicide-stonith-plugin? It maybe "testing-only", yes. But at least, ssh is 
> working as intended.
>
>>> It's badly documented, and I didn't find a single (official) document
>>> on howto implement a (stable!) suicide-stonith,
>> Because you can't.  Suicide is not, will not, can not be reliable.
> Yes, you're right. But under certain circumstances (1. nodes are still alive, 
> 2. both redundant communication channels [networks] are down, 3. policy 
> requires no node to be up, which has no quorum) it might be a good addition 
> to a "regular" stonith (because if [2] happens, pacemaker/stonith will 
> probably not be able to control a network power switch etc.) Could we agree 
> on that?

Sure. But even if you have a functioning suicide plugin, Pacemaker
cannot ever make decisions that assume it worked.
Because for all it knows the other side might consider itself to be
perfectly healthy.

> If not: What's your recommended setup for (resp. against) such situations? 
> Think of "split sites" here!

You still need reliable fencing, if you cant provide that, there needs
to be a human in the loop.

>> The whole point of stonith is to create a known node state (off) in 
>> situations where you cannot be sure if your peer is alive, dead > or some 
>> state in-between.
> Yes, so don't file "suicide" under "stonith"! We implemented a different 
> approach in a two node cluster: We wrote a script that checks (by means of 
> cron) the connectivity (by means of ping) to the peer (if connected, 
> everything fine) and then (if peer are not reachable) to some quorum nodes. 
> If either the peer or a majority of the quorum nodes are alive, nothing 
> happens. If "quorum" is lost, the node shut's itself down.

Wonderful, but the "healthy" side still can't do anything, because it
can't know that the "bad" side is down.
So what have you gained over no-quorum-policy=stop (which is the default) ?

>
> We did that, because drbd tended to misbehave in situations, where all 
> network connectivity was lost. We'd rather have a clean shutdown on both 
> sides, than a corrupt filesystem. I always consider this solution as 
> unelegant, mainly because it wasn't controllable via crm. Thus I hoped, I 
> could forget this solution when using pacemaker. It seems, I can not.
>
> If there's any interest from the community in our "suicide by cron"-solution, 
> tell me if and how to contribute.
>
>> It requires a "sick" node to suddenly start functioning correctly - so 
>> attempting to self-terminate makes some sense, relying on it to succeed does 
>> not seem prudent.
>
> Yeeees! But it's not always the node, that's sick. Sometimes (even with the 
> best and most redundant network), the connectivity between the node ist the 
> problem, not a marauding pacemaker or openais! Again: Please tell me, what's 
> your solution in that case?

Again, tell me how the other side is supposed to know and what you gain?

>
>>> On the other hand, it doen't make any other sense to name a 
>>> "no-quorum-policy" "suicide", if it's anything, but a suicide (if, at all, 
>>> one could name it "assisted suicide").
>
> This question is still unanswered. Does "no quorum-policy suicide" really 
> have a meaning?

yes, for N > 2, it is a faster version of "stop"

> Or is it as well a leftover from the times of "heartbeat".

no

> Is it still functional?

yes
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to