On Dec 19, 2007, at 2:13 PM, Lars Marowsky-Bree wrote:

On 2007-12-19T11:32:12, Andrew Beekhof <[EMAIL PROTECTED]> wrote:

i prefer to use the "crm respawn" directive which disables the fast- fail
logic^.
when a non-transient problem like this occurs and heartbeat is started at boot time (which is the normal thing to do), you have about 2s to identify
and fix the problem before the node reboots again

personally, i find this timeframe unrealistic

This is not about "identifying" the problem, but about quickly resolving
transient errors.

Don't lecture me, I know what fast-fail is about.
What I am clearly talking about is how it makes the cluster behave in situations fast-fail wasn't designed for and isn't appropriate for.
You can't just make the issue go away with "well ..."


I am still yet to hear a single concrete example of where fast-fail for Heartbeat child processes adds any value to a cluster that already has stonith^.

I'm not saying it shouldn't exist, but if it's not adding value and comes with side-effects pathological behavior in non-transient situations and nodes always being shot as they come up again^^... why on earth do we have it enabled unconditionally?


^ Stonith already ensures the node is dead before starting the resources it had. We're not talking about rogue resources trashing data - that is in no way covered by this implementation of fast-fail.

^^ Bugzilla #1810.


If, as in this case, the problem isn't transient, well ...


Fast-fail is the right approach. I'd argue that the saner default might
be to use fast-fail to cause a "crash" (including a crashdump for
debugging) instead of entering a reboot loop, yes.

(Combined with STONITH, the other nodes still might decide to reboot the
node; possibly allowing enough time for it to actually dump would be
saner still.)

Fast-fail clearly is the right direction to take, though.



Regards,
   Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to