Hi Steven,
have you got a formula to calculate the timeout with regard to
token,token_retransmits_before_loss_const , and
consensus values ?
and is there any risk on corosync behavior, stability, etc. if we
increase this time to around 45s / 60s ?
does anybody have experienced ?
Thanks
Regards
Alain
Hi Steven,
I've git it a try :
the values of token=45000 and token_retransmits_before_loss_const=45 leads
to also set consensus=54000 (at least 1,2 * token) otherwise corosync
start fails. With these values, when I do ifdown eth0 on one node, in
fact it takes around 98s
for this node to appear OFFLINE on crm_mon on the healthy node, so I don't
exactly know which is the formula ?
Thanks
Regards
Alain
token: 45000
token_retransmits_before_loss_const: 45
On Wed, 2010-05-19 at 08:39 +0200, Alain.Moulle wrote:
Hi Steven
in fact, I 've at first post this question on the Pacemaker ML,
but there is no way in Pacemaker to increase this time, and
I think it is normal as the "cluster manager" part is provided
by corosync, managing the heartbeat. My concern is to largely
increase this time, until even values
as 45s, this is not a problem for applications I have to manage,
but 10s is really a big problem for me, in case of network
problem which lead to silence on heartbeat for a while. So,
based on your experience, which parameters do you think I can
try to increase to get this 45s timeout ?
Thanks a lot.
Regards
Alain
On Mon, 2010-05-17 at 08:25 +0200, Alain.Moulle wrote:
Hi again,
I 've checked the man corosync.conf and seen many parameters
around token timers etc. but I can't see how to increase
the heartbeat
timeout. When testing, it occurs that timeout is between
10s and 12s
before a node decides to fence another one in the cluster
(when for
example I force a if down eth0 on this node to simulate
Heartbeat failure).
But I can't see which parameter(s) to tune in corosync.conf
to increase
these 10 or 12s ...
Any tip would be appreciated...
Thanks
Alain
Alain,
I don't have a direct answer to your question. Corosync detects a
failure of any node in "token" msec. I have not measured how long
qpid/fencing/pacemaker/rgmanager/gfs/ocfs/etc take to operate on
this
notification. This delta between failure detection and recovery
would
be a good question to potentially ask on the pacemaker ml.
In my test environments I run at token = 1000 msec. Totem can be
tuned
to lower values, but under a heavy network load, may falsely detect
a
node failure.
Most products that use Corosync ship with a 10000msec (10sec) or
larger
token value to offer least chance of false node detection.
The token timer is just one consideration, however. The
"token_retransmits_before_loss_const" defaults to 4. This may be
too
low in lossy or heavy load networks. A higher value for this
configuration produces a bit more load but more resilient behavior.
Regards
-steve
_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais
_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais