Hi Steven,

have you got a formula to calculate the timeout with regard to token,token_retransmits_before_loss_const , and
consensus values ?

and is there any risk on corosync behavior, stability, etc. if we increase this time to around 45s / 60s ?

does anybody have experienced ?

Thanks
Regards
Alain
Hi Steven,
I've git it a try :
the values of token=45000 and token_retransmits_before_loss_const=45 leads
to also set consensus=54000 (at least 1,2 * token) otherwise corosync start fails. With these values, when I do ifdown eth0 on one node, in fact it takes around 98s
for this node to appear OFFLINE on crm_mon on the healthy node, so I don't
exactly know which is the formula ?

Thanks
Regards
Alain

    token: 45000
    token_retransmits_before_loss_const: 45

     On Wed, 2010-05-19 at 08:39 +0200, Alain.Moulle wrote:
        Hi Steven
        in fact, I 've at first post this question on the Pacemaker ML,
        but there is no way in Pacemaker to increase this time, and
        I think it is normal as the "cluster manager" part is provided
        by corosync, managing the heartbeat. My concern is to largely
        increase this time, until even values

        as 45s, this is not a problem for applications I have to manage,
        but 10s is really a big problem for me, in case of network
        problem which lead to silence on heartbeat for a while. So,
        based on your experience, which parameters do you think I can
        try to increase to get this 45s timeout ?

        Thanks a lot.
        Regards
        Alain
            On Mon, 2010-05-17 at 08:25 +0200, Alain.Moulle wrote:
                    Hi again,

                    I 've checked the man corosync.conf and seen many parameters
                    around token timers etc. but I can't see how to increase 
the heartbeat
                    timeout. When testing, it occurs that timeout is between 
10s and 12s
                    before a node decides to fence another one in the cluster 
(when for
                    example I force a if down eth0 on this node to simulate 
Heartbeat failure).
                    But I can't see which parameter(s) to tune in corosync.conf 
to increase
                    these 10 or 12s ...

                    Any tip would be appreciated...
                    Thanks
                    Alain
            Alain,

            I don't have a direct answer to your question.  Corosync detects a
            failure of any node in "token" msec.  I have not measured how long
            qpid/fencing/pacemaker/rgmanager/gfs/ocfs/etc take to operate on 
this
            notification.  This delta between failure detection and recovery 
would
            be a good question to potentially ask on the pacemaker ml.

            In my test environments I run at token = 1000 msec.  Totem can be 
tuned
            to lower values, but under a heavy network load, may falsely detect 
a
            node failure.

            Most products that use Corosync ship with a 10000msec (10sec) or 
larger
            token value to offer least chance of false node detection.

            The token timer is just one consideration, however.  The
            "token_retransmits_before_loss_const" defaults to 4.  This may be 
too
            low in lossy or heavy load networks.  A higher value for this
            configuration produces a bit more load but more resilient behavior.

            Regards
            -steve


        _______________________________________________
        Openais mailing list
        Openais@lists.linux-foundation.org
        https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to