I have about a dozen corosync+pacemaker clusters and I am just now getting 
around to understanding timeouts.

Most of my corosync.conf files look something like this:

        version:        2
        token:          5000
        token_retransmits_before_loss_const: 10
        join:           1000
        consensus:      7500
        vsftype:        none
        max_messages:   20
        secauth:        off
        threads:        0
        clear_node_high_bit: yes
        rrp_mode: active

If I understand this correctly, this means the node will wait 50 seconds 
(5000ms x 10) before deciding that a cluster reconfig is necessary (perhaps 
after a link failure). Is that correct?

I'm trying to understand how this works together with my bonded NIC's 
arp_interval settings. I normally set arp_interval=1000. My question is, how 
many arp losses are required before the bonding driver decides to failover to 
the other link? If arp_interval=1000, how many times does the driver send an 
arp and fail to receive a reply before it decides that the link is dead?

I think I need to know this so I can set my corosync.conf settings correctly to 
avoid "false positive" cluster failovers. In other words, if there is a link or 
switch failure, I want to make sure that the cluster allows plenty of time for 
link communication to recover before deciding that a node has actually died. 

--
Eric Robinson


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to