On 2013-10-02T09:36:14, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote:
> In general I'm afraid you cannot handle this situation in a perfect way: > > You have two types of problems: > 1) A node, resource, or monitor is hanging, but a long timeout prevents to > recognize this in time > 2) A node, resource, or monitor is performing slower than usual, but a short > timeout causes the cluster to think there is a problem with the > node/resource/monitor Yes, or to summarize, timeouts suck for failure detection, but for many cases, we don't have anything better. Digging out my age old post: http://advogato.org/person/lmb/diary/108.html A massively overloaded system is indistinguishable from a failing or hung one. On the plus side, if a system is *that* overloaded that corosync isn't being scheduled and it's rather limited network traffic presents a problem, it is likely so FUBAR'ed that fencing it doesn't make things worse. So the misdiagnosis isn't necessarily a problem. > BTW: We had eperienced hanging I/O when one of our SAN devices had a > problem, but the others did not. Still the SLES11 SP2 kernel saw > stalled I/Os for obviously unaffected devices. The problem is being > investigated... FC can be weird like that if it is routed through the same HBA or switch. It's not always a kernel problem, the fabric isn't trivial either. Good luck with finding the root cause :-/ Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems