On 10/10/2016 06:58 PM, Eric Robinson wrote: > Thanks for the clarification. So what's the easiest way to ensure that the > cluster waits a desired timeout before deciding that a re-convergence is > necessary?
By raising the token (lost) timeout I would say. Please correct my (Chrissie) but I see the token (lost) timout somehow as resilience against static delays + jitter on top and the token_retransmits_before_loss_const as resilience against packet-loss. > > -- > Eric Robinson > > > -----Original Message----- > From: Christine Caulfield [mailto:ccaul...@redhat.com] > Sent: Monday, October 10, 2016 4:34 AM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Establishing Timeouts > > On 10/10/16 05:51, Eric Robinson wrote: >> I have about a dozen corosync+pacemaker clusters and I am just now getting >> around to understanding timeouts. >> >> Most of my corosync.conf files look something like this: >> >> version: 2 >> token: 5000 >> token_retransmits_before_loss_const: 10 >> join: 1000 >> consensus: 7500 >> vsftype: none >> max_messages: 20 >> secauth: off >> threads: 0 >> clear_node_high_bit: yes >> rrp_mode: active >> >> If I understand this correctly, this means the node will wait 50 seconds >> (5000ms x 10) before deciding that a cluster reconfig is necessary (perhaps >> after a link failure). Is that correct? >> > No that's not correct. the token timeout is 5 seconds in your example - > because token is 5000mS. the token timeout is always what the value of > totem.token is. > > token_retransmits_before_loss_const affects the token hold timeout - which is > how long the token is held on a node that has no messages to send before > being forwarded on. So increasing token_retransmits_before_loss_const changes > the number of times per 'token' timeout that the token is actually sent. > > In the example above you will see that the token is sent approximately > 5000/10 = 500 mS. That's approximate, the value is scaled slightly to make > actual timeouts less likely, and also is affected by messages that may beed > to be sent. > > Chrissie > >> I'm trying to understand how this works together with my bonded NIC's >> arp_interval settings. I normally set arp_interval=1000. My question is, how >> many arp losses are required before the bonding driver decides to failover >> to the other link? If arp_interval=1000, how many times does the driver send >> an arp and fail to receive a reply before it decides that the link is dead? >> >> I think I need to know this so I can set my corosync.conf settings correctly >> to avoid "false positive" cluster failovers. In other words, if there is a >> link or switch failure, I want to make sure that the cluster allows plenty >> of time for link communication to recover before deciding that a node has >> actually died. >> >> -- >> Eric Robinson >> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org