On 2007-11-06T10:25:05, Alan Robertson <[EMAIL PROTECTED]> wrote: > For problems that should "never" happen like death of one of our core/key > processes, is an immediate reboot of the machine the right recovery > technique? > > The advantages of such a choice include: > It is fast > It will invoke recovery paths that we exercise a lot in testing > It is MUCH simpler than trying to recover from all these cases, > therefore almost certainly more reliable
FailFast / self-fencing is certainly a good default. We can, for selective processes, always get more fancy. I'd be happy with FailFast for the core processes, if we get better recovery for the network-facing processes, possibly stonithd - at least as long as it executes plugins within its own context. An alternative is an immediate restart of the whole cluster processes locally, but that can cause fluctuations as well. My suggestion would be to combine this with the watchdog system to trigger a reboot, or to simply stop heartbeating and rely on the other nodes to shoot us. > The disadvantages of such a choice include: > It is crude, and very annoying It's not very annoying; it means that the machine is beyond repair, anyway. > It probably shouldn't be invoked for single-node clusters (?) It's similar to killing init, which will also reboot the machine. No big deal. Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/