On 2013-10-02T09:36:14, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote:

> In general I'm afraid you cannot handle this situation in a perfect way:
> 
> You have two types of problems:
> 1) A node, resource, or monitor is hanging, but a long timeout prevents to
> recognize this in time
> 2) A node, resource, or monitor is performing slower than usual, but a short
> timeout causes the cluster to think there is a problem with the
> node/resource/monitor

Yes, or to summarize, timeouts suck for failure detection, but for many
cases, we don't have anything better. Digging out my age old post:
http://advogato.org/person/lmb/diary/108.html

A massively overloaded system is indistinguishable from a failing or
hung one. On the plus side, if a system is *that* overloaded that
corosync isn't being scheduled and it's rather limited network traffic
presents a problem, it is likely so FUBAR'ed that fencing it doesn't
make things worse. So the misdiagnosis isn't necessarily a problem.

> BTW: We had eperienced hanging I/O when one of our SAN devices had a
> problem, but the others did not. Still the SLES11 SP2 kernel saw
> stalled I/Os for obviously unaffected devices. The problem is being
> investigated...

FC can be weird like that if it is routed through the same HBA or
switch. It's not always a kernel problem, the fabric isn't trivial
either. Good luck with finding the root cause :-/


Regards,
    Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to