On Dec 19, 2007, at 2:13 PM, Lars Marowsky-Bree wrote:
On 2007-12-19T11:32:12, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
i prefer to use the "crm respawn" directive which disables the fast-
fail
logic^.
when a non-transient problem like this occurs and heartbeat is
started at
boot time (which is the normal thing to do), you have about 2s to
identify
and fix the problem before the node reboots again
personally, i find this timeframe unrealistic
This is not about "identifying" the problem, but about quickly
resolving
transient errors.
Don't lecture me, I know what fast-fail is about.
What I am clearly talking about is how it makes the cluster behave in
situations fast-fail wasn't designed for and isn't appropriate for.
You can't just make the issue go away with "well ..."
I am still yet to hear a single concrete example of where fast-fail
for Heartbeat child processes adds any value to a cluster that already
has stonith^.
I'm not saying it shouldn't exist, but if it's not adding value and
comes with side-effects pathological behavior in non-transient
situations and nodes always being shot as they come up again^^... why
on earth do we have it enabled unconditionally?
^ Stonith already ensures the node is dead before starting the
resources it had.
We're not talking about rogue resources trashing data - that is in no
way covered by this implementation of fast-fail.
^^ Bugzilla #1810.
If, as in this case, the problem isn't transient, well ...
Fast-fail is the right approach. I'd argue that the saner default
might
be to use fast-fail to cause a "crash" (including a crashdump for
debugging) instead of entering a reboot loop, yes.
(Combined with STONITH, the other nodes still might decide to reboot
the
node; possibly allowing enough time for it to actually dump would be
saner still.)
Fast-fail clearly is the right direction to take, though.
Regards,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar
Wilde
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems