Re: [ClusterLabs] Corosync 3.1.0 token timeout

Jan Friesse Thu, 22 Oct 2020 00:36:41 -0700

Ulrich,

Jan Friesse <jfrie...@redhat.com> schrieb am 20.10.2020 um 18:05 in Nachricht

<9e9edd13-847c-a81f-9b28-0ecf8f17f...@redhat.com>:

I've forgot to mention one very important change (in text, release notes
at github release is already fixed):

...


- Default token timeout was changed from 1 seconds to 3 seconds. Default


Hi!

The same stupid question as always: How is that value determined, assuming that 
in a LAN the per-hop delay should be less than 1ms these days and the numbe rof 
nodes typically is much less than 10. Ist there a safety-factor of 1000%, or 
what?
Or is this just black magic, and the value was determined in a sleepless 
fulll-mood night by throwing dice?


It's somewhere in the middle actually.

Reason for increasing the value is number of GSS cases where increase oftoken timeout helped reduce number of "unexpected" fencing events.

The proposal was to increase the value to 5 secs, but that would makeupgrading hard, because nodes with old version would detect token loss(default config is resend token 4 times so 5s/4 = 1.25 secs).


There is no such problem with 3 secs.

The main problem is that choosing timeouts is not exact science. We haveto choose timeout which is high enough to give nodes enough time in caseof spikes (various ones - cpu/blocked IO/network/...) but also lowenough to react as quickly as possible. 1 secs was working well most ofthe time, but then something bad happened and node was fenced "withoutthe reason". So to conclude, yes, it is kind of black magic.


Regards,
  Honza


Regards,
Ulrich

token timeout of 1000 ms was often changed by users because of other
workloads on machine which may make corosync responding a bit later than
needed and resulting in token loss. 3000 ms was chosen as a compromise
between token timeout increase and allow live cluster upgrade (other
nodes should receive token by node with new default on time). It doesn't
affect token token_coefficient so final token timeout still depends on
number of configured nodes (just base is higher).  This change slows
down failover a bit so for clusters where failover times are important,
please change the token timeout in configuration file corosync.conf as a:

totem {
    version: 2
    token: 1000
    ...



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Corosync 3.1.0 token timeout

Reply via email to