Re: [Linux-HA] Looking for a suitable Stonith Solution

Stallmann, Andreas Wed, 02 Mar 2011 00:05:57 -0800

Hi Andrew,

>> If "suicide" is no supported fencing option, why is it still included with 
>> stonith?
> Left over from heartbeat v1 days I guess.
> Could also be a testing-only device like ssh.


www.clusterlabs.org tells me, you're the Pacemaker project leader. Would you, 
by chance, know who maintains or maintained the suicide-stonith-plugin? It 
maybe "testing-only", yes. But at least, ssh is working as intended.

>> It's badly documented, and I didn't find a single (official) document
>> on howto implement a (stable!) suicide-stonith,
> Because you can't.  Suicide is not, will not, can not be reliable.
Yes, you're right. But under certain circumstances (1. nodes are still alive, 
2. both redundant communication channels [networks] are down, 3. policy 
requires no node to be up, which has no quorum) it might be a good addition to 
a "regular" stonith (because if [2] happens, pacemaker/stonith will probably 
not be able to control a network power switch etc.) Could we agree on that? If 
not: What's your recommended setup for (resp. against) such situations? Think 
of "split sites" here!

> The whole point of stonith is to create a known node state (off) in 
> situations where you cannot be sure if your peer is alive, dead > or some 
> state in-between.
Yes, so don't file "suicide" under "stonith"! We implemented a different 
approach in a two node cluster: We wrote a script that checks (by means of 
cron) the connectivity (by means of ping) to the peer (if connected, everything 
fine) and then (if peer are not reachable) to some quorum nodes. If either the 
peer or a majority of the quorum nodes are alive, nothing happens. If "quorum" 
is lost, the node shut's itself down.

We did that, because drbd tended to misbehave in situations, where all network 
connectivity was lost. We'd rather have a clean shutdown on both sides, than a 
corrupt filesystem. I always consider this solution as unelegant, mainly 
because it wasn't controllable via crm. Thus I hoped, I could forget this 
solution when using pacemaker. It seems, I can not.

If there's any interest from the community in our "suicide by cron"-solution, 
tell me if and how to contribute.

> It requires a "sick" node to suddenly start functioning correctly - so 
> attempting to self-terminate makes some sense, relying on it to succeed does 
> not seem prudent.

Yeeees! But it's not always the node, that's sick. Sometimes (even with the 
best and most redundant network), the connectivity between the node ist the 
problem, not a marauding pacemaker or openais! Again: Please tell me, what's 
your solution in that case?

>> On the other hand, it doen't make any other sense to name a 
>> "no-quorum-policy" "suicide", if it's anything, but a suicide (if, at all, 
>> one could name it "assisted suicide").

This question is still unanswered. Does "no quorum-policy suicide" really have 
a meaning? Or is it as well a leftover from the times of "heartbeat". Is it 
still functional?

Cheers,

Andreas

------------------------
CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke 
Höfer
Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans Jürgen 
Niemeier

CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef.
Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 )
Vorstand/Member of the Managementboard: Rüdiger Zeyen (Sprecher/Chairman), 
Wilfried Pütz
Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Dr. Gerd 
Jakob
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Looking for a suitable Stonith Solution

Reply via email to