Ulrich,

Jan Friesse <jfrie...@redhat.com> schrieb am 20.10.2020 um 18:05 in Nachricht
<9e9edd13-847c-a81f-9b28-0ecf8f17f...@redhat.com>:
I've forgot to mention one very important change (in text, release notes
at github release is already fixed):

...

- Default token timeout was changed from 1 seconds to 3 seconds. Default

Hi!

The same stupid question as always: How is that value determined, assuming that 
in a LAN the per-hop delay should be less than 1ms these days and the numbe rof 
nodes typically is much less than 10. Ist there a safety-factor of 1000%, or 
what?
Or is this just black magic, and the value was determined in a sleepless 
fulll-mood night by throwing dice?

It's somewhere in the middle actually.

Reason for increasing the value is number of GSS cases where increase of token timeout helped reduce number of "unexpected" fencing events.

The proposal was to increase the value to 5 secs, but that would make upgrading hard, because nodes with old version would detect token loss (default config is resend token 4 times so 5s/4 = 1.25 secs).

There is no such problem with 3 secs.

The main problem is that choosing timeouts is not exact science. We have to choose timeout which is high enough to give nodes enough time in case of spikes (various ones - cpu/blocked IO/network/...) but also low enough to react as quickly as possible. 1 secs was working well most of the time, but then something bad happened and node was fenced "without the reason". So to conclude, yes, it is kind of black magic.

Regards,
  Honza


Regards,
Ulrich

token timeout of 1000 ms was often changed by users because of other
workloads on machine which may make corosync responding a bit later than
needed and resulting in token loss. 3000 ms was chosen as a compromise
between token timeout increase and allow live cluster upgrade (other
nodes should receive token by node with new default on time). It doesn't
affect token token_coefficient so final token timeout still depends on
number of configured nodes (just base is higher).  This change slows
down failover a bit so for clusters where failover times are important,
please change the token timeout in configuration file corosync.conf as a:

totem {
    version: 2
    token: 1000
    ...


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to