Re: [ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Ferenc Wágner Sat, 09 Sep 2017 23:31:46 -0700

wf...@niif.hu (Ferenc Wágner) writes:

> Jan Friesse <jfrie...@redhat.com> writes:
>
>> wf...@niif.hu writes:
>>
>>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
>>> (in August; in May, it happened 0-2 times a day only, it's slowly
>>> ramping up):
>>>
>>> vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new 
>>> configuration.
>>> vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new 
>>> configuration.
>>> vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not scheduled 
>>> for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout 
>>> increase.
>>
>> ^^^ This is main problem you have to solve. It usually means that
>> machine is too overloaded. It is happening quite often when corosync
>> is running inside VM where host machine is unable to schedule regular
>> VM running.
>
> After some extensive tracing, I think the problem lies elsewhere: my
> IPMI watchdog device is slow beyond imagination.


Confirmed: setting watchdog_device: off cluster wide got rid of the
above warnings.

> Its ioctl operations can take seconds, starving all other functions.
> At least, it seems to block the main thread of Corosync.  Is this a
> plausible scenario?  Corosync has two threads, what are their roles?
-- 
Regards,
Feri

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Reply via email to