Hi!

In general I'm afraid you cannot handle this situation in a perfect way:

You have two types of problems:
1) A node, resource, or monitor is hanging, but a long timeout prevents to
recognize this in time
2) A node, resource, or monitor is performing slower than usual, but a short
timeout causes the cluster to think there is a problem with the
node/resource/monitor

So your timeout has to be where both 1) and 2) occur with minimal probability.
Obviously if runtimes vary, the solution cannot be optimal.

BTW: We had eperienced hanging I/O when one of our SAN devices had a problem,
but the others did not. Still the SLES11 SP2 kernel saw stalled I/Os for
obviously unaffected devices. The problem is being investigated...

Regards,
Ulrich


>>> Moullé Alain<alain.mou...@bull.net> schrieb am 01.10.2013 um 17:01 in
Nachricht
<524ae3e0.8060...@bull.net>:
> Hi,
> 
> with stack Pacemaker/corosync;
> 
> suppose that a node in a HA cluster is so loaded (IOs, etc.) during more 
> than the heartbeat timeout value but temporarily loaded, so loaded that 
> it can't even no more manage heartbeat tokens, and it is fenced because 
> he can't manage heartbeat tokens, whereis there is not a real problem, 
> just a node temporarily overloaded.
> 
> how do you/could we manage this type of problem ?
> is there a way to always give higher priority to the corosync traffic 
> upon any other load ?
> 
> Thanks
> Alain
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org 
> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> See also: http://linux-ha.org/ReportingProblems 


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to