Re: [Linux-ha-dev] Recovering from "unexpected bad things" - is STONITH the answer?

Lars Marowsky-Bree Tue, 06 Nov 2007 12:37:11 -0800

On 2007-11-06T10:25:05, Alan Robertson <[EMAIL PROTECTED]> wrote:

> For problems that should "never" happen like death of one of our core/key 
> processes, is an immediate reboot of the machine the right recovery 
> technique?
>
> The advantages of such a choice include:
>  It is fast
>  It will invoke recovery paths that we exercise a lot in testing
>  It is MUCH simpler than trying to recover from all these cases,
>       therefore almost certainly more reliable


FailFast / self-fencing is certainly a good default. We can, for
selective processes, always get more fancy.

I'd be happy with FailFast for the core processes, if we get better
recovery for the network-facing processes, possibly stonithd - at least
as long as it executes plugins within its own context.

An alternative is an immediate restart of the whole cluster processes
locally, but that can cause fluctuations as well.

My suggestion would be to combine this with the watchdog system to
trigger a reboot, or to simply stop heartbeating and rely on the other
nodes to shoot us.

> The disadvantages of such a choice include:
>  It is crude, and very annoying

It's not very annoying; it means that the machine is beyond repair,
anyway.

>  It probably shouldn't be invoked for single-node clusters (?)

It's similar to killing init, which will also reboot the machine. No big
deal.


Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Recovering from "unexpected bad things" - is STONITH the answer?

Reply via email to