Hi! I don't know the answer, but I wonder what would happen if corosync runs at normal scheduling priority. My suspect is that something's wrong, and using highest real-time priority could be the wrong fix for that problem ;-)
Personally I think a process that does disk I/O and is waiting for network input cannot be the highest priority real-time job. (Such a candidate would be a process that had it's memeory locked and is doing shared memory communication without any I/O)... Sorry for this off-topic thought. Regards, Ulrich >>> Ferenc Wágner <wf...@niif.hu> schrieb am 01.09.2017 um 00:40 in Nachricht <87inh38ip3....@lant.ki.iif.hu>: > Jan Friesse <jfrie...@redhat.com> writes: > >> wf...@niif.hu writes: >> >>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day >>> (in August; in May, it happened 0-2 times a day only, it's slowly >>> ramping up): >>> >>> vhbl08 corosync[3687]: [TOTEM ] A processor failed, forming new > configuration. >>> vhbl03 corosync[3890]: [TOTEM ] A processor failed, forming new > configuration. >>> vhbl07 corosync[3805]: [MAIN ] Corosync main process was not scheduled > for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout > increase. >> >> ^^^ This is main problem you have to solve. It usually means that >> machine is too overloaded. [...] > > Before I start tracing the scheduler, I'd like to ask something: what > wakes up the Corosync main process periodically? The token making a > full circle? (Please forgive my simplistic understanding of the TOTEM > protocol.) That would explain the recommendation in the log message, > but does not fit well with the overload assumption: totally idle nodes > could just as easily produce such warnings if there are no other regular > wakeup sources. (I'm looking at timer_function_scheduler_timeout but I > know too little of libqb to decide.) > >> As a start you can try what message say = Consider token timeout >> increase. Currently you have 3 seconds, in theory 6 second should be >> enough. > > It was probably high time I realized that token timeout is scaled > automatically when one has a nodelist. When you say Corosync should > work OK with default settings up to 16 nodes, you assume this scaling is > in effect, don't you? On the other hand, I've got no nodelist in the > config, but token = 3000, which is less than the default 1000+4*650 with > six nodes, and this will get worse as the cluster grows. > > Comments on the above ramblings welcome! > > I'm grateful for all the valuable input poured into this thread by all > parties: it's proven really educative in quite unexpected ways beyond > what I was able to ask in the beginning. > -- > Thanks, > Feri > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org