Re: [ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Jan Friesse Tue, 29 Aug 2017 08:12:03 -0700

Ferenc,

Jan Friesse <jfrie...@redhat.com> writes:

wf...@niif.hu writes:

In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
(in August; in May, it happened 0-2 times a day only, it's slowly
ramping up):

vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new configuration.
vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new configuration.
vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not scheduled for 
4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout increase.


^^^ This is main problem you have to solve. It usually means that
machine is too overloaded. It is happening quite often when corosync
is running inside VM where host machine is unable to schedule regular
VM running.


Hi Honza,

Corosync isn't running in a VM here, these nodes are 2x8 core servers
hosting VMs themselves as Pacemaker resources.  (Incidentally, some of
these VMs run Corosync to form a test cluster, but that should be
irrelevant now.)  And they aren't overloaded in any apparent way: Munin
reports 2900% CPU idle (out of 32 hyperthreads).  There's no swap, but
the corosync process is locked into memory anyway.  It's also running as
SCHED_RR prio 99, competing only with multipathd and the SCHED_FIFO prio
99 kernel threads (migration/* and watchdog/*) under Linux 4.9.  I'll
try to take a closer look at the scheduling of these.  Can you recommend
some indicators to check out?

No real hints. But one question. Are you 100% sure memory is locked?Because we had problem where mlockall was called in wrong place socorosync was actually not locked and it was causing similar issues.


This behavior is fixed by
https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26


Are scheduling delays expected to generate TOTEM membership "changes"
without any leaving and joining nodes?


Yes it is

As a start you can try what message say = Consider token timeout
increase. Currently you have 3 seconds, in theory 6 second should be
enough.


OK, thanks for the tip.  Can I do this on-line, without shutting down
Corosync?

Corosync way is to just edit/copy corosync.conf on all nodes and callcorosync-cfgtool -R on one of the nodes (crmsh/pcs may have better way).


Regards,
  Honza


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Reply via email to