Hi! In general I'm afraid you cannot handle this situation in a perfect way:
You have two types of problems: 1) A node, resource, or monitor is hanging, but a long timeout prevents to recognize this in time 2) A node, resource, or monitor is performing slower than usual, but a short timeout causes the cluster to think there is a problem with the node/resource/monitor So your timeout has to be where both 1) and 2) occur with minimal probability. Obviously if runtimes vary, the solution cannot be optimal. BTW: We had eperienced hanging I/O when one of our SAN devices had a problem, but the others did not. Still the SLES11 SP2 kernel saw stalled I/Os for obviously unaffected devices. The problem is being investigated... Regards, Ulrich >>> Moullé Alain<alain.mou...@bull.net> schrieb am 01.10.2013 um 17:01 in Nachricht <524ae3e0.8060...@bull.net>: > Hi, > > with stack Pacemaker/corosync; > > suppose that a node in a HA cluster is so loaded (IOs, etc.) during more > than the heartbeat timeout value but temporarily loaded, so loaded that > it can't even no more manage heartbeat tokens, and it is fenced because > he can't manage heartbeat tokens, whereis there is not a real problem, > just a node temporarily overloaded. > > how do you/could we manage this type of problem ? > is there a way to always give higher priority to the corosync traffic > upon any other load ? > > Thanks > Alain > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems