On 08/30/2017 08:54 AM, Jan Friesse wrote: > Ferenc, > >> Jan Friesse <jfrie...@redhat.com> writes: >> >>> wf...@niif.hu writes: >>> >>>> Jan Friesse <jfrie...@redhat.com> writes: >>>> >>>>> wf...@niif.hu writes: >>>>> >>>>>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a >>>>>> day >>>>>> (in August; in May, it happened 0-2 times a day only, it's slowly >>>>>> ramping up): >>>>>> >>>>>> vhbl08 corosync[3687]: [TOTEM ] A processor failed, forming new >>>>>> configuration. >>>>>> vhbl03 corosync[3890]: [TOTEM ] A processor failed, forming new >>>>>> configuration. >>>>>> vhbl07 corosync[3805]: [MAIN ] Corosync main process was not >>>>>> scheduled for 4317.0054 ms (threshold is 2400.0000 ms). Consider >>>>>> token timeout increase. >>>>> >>>>> ^^^ This is main problem you have to solve. It usually means that >>>>> machine is too overloaded. It is happening quite often when corosync >>>>> is running inside VM where host machine is unable to schedule regular >>>>> VM running. >>>> >>>> Corosync isn't running in a VM here, these nodes are 2x8 core servers >>>> hosting VMs themselves as Pacemaker resources. (Incidentally, some of >>>> these VMs run Corosync to form a test cluster, but that should be >>>> irrelevant now.) And they aren't overloaded in any apparent way: >>>> Munin >>>> reports 2900% CPU idle (out of 32 hyperthreads). There's no swap, but >>>> the corosync process is locked into memory anyway. It's also >>>> running as >>>> SCHED_RR prio 99, competing only with multipathd and the SCHED_FIFO >>>> prio >>>> 99 kernel threads (migration/* and watchdog/*) under Linux 4.9. I'll >>>> try to take a closer look at the scheduling of these. Can you >>>> recommend >>>> some indicators to check out?
Just seen that you are hosting VMs which might make you use KSM ... Don't fully remember at the moment but I have some memory of issues with KSM and page-locking. iirc it was some bug in the kernel memory-management that should be fixed a long time ago but ... Regards, Klaus >>>> >>> >>> No real hints. But one question. Are you 100% sure memory is locked? >>> Because we had problem where mlockall was called in wrong place so >>> corosync was actually not locked and it was causing similar issues. >>> >>> This behavior is fixed by >>> https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26 >>> >> >> I based this assertion on the L flag in the ps STAT column. The above >> commit should not affect me because I'm running corosync with the -f >> option: > > Oh, ok. If you are running with -f then bug above doesn't affect you. > >> >> $ ps l 3805 >> F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME >> COMMAND >> 4 0 3805 1 -100 - 247464 141016 - SLsl ? 251:10 >> /usr/sbin/corosync -f >> >> By the way, are the above VSZ and RSS numbers reasonable? > > yep, perfectly reasonable. > > Regards, > Honza > >> >> One more thing: these servers run without any swap. >> >>>>> As a start you can try what message say = Consider token timeout >>>>> increase. Currently you have 3 seconds, in theory 6 second should be >>>>> enough. >>>> >>>> OK, thanks for the tip. Can I do this on-line, without shutting down >>>> Corosync? >>> >>> Corosync way is to just edit/copy corosync.conf on all nodes and call >>> corosync-cfgtool -R on one of the nodes (crmsh/pcs may have better >>> way). >> >> Great, that's what I wanted to know: whether -R is expected to make this >> change effective. > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org