Re: [ClusterLabs] Corosync main process was not scheduled for 115935.2266 ms (threshold is 800.0000 ms). Consider token timeout increase.

Jan Friesse Thu, 25 Feb 2016 00:52:23 -0800

Adam Spiers napsal(a):

Hi all,


Jan Friesse <jfrie...@redhat.com> wrote:

There is really no help. It's best to make sure corosync is scheduled

regularly.
I may sound silly, but how can I do it?


It's actually very hard to say. Pauses like 30 sec is really unusual
and shouldn't happen (specially with RT scheduling). It is usually
happening on VM where host is overcommitted.


It's funny you are discussing this during the same period where my
team is seeing this happen fairly regularly within VMs on a host which
is overcommitted.  In other words, I can confirm Jan's statement above
is true.

Yep, sadly VM affects scheduling a lot. For cluster nodes it actuallyreally make sense to map every virtual cpu core to physical cpu core.


Like Konstiantyn, we have also sometimes seen no fencing occur as a
result of these pauses, e.g.

Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [MAIN  ] Corosync main 
process was not scheduled for 7343.1909 ms (threshold is 4000.0000 ms). 
Consider token timeout increase.
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [TOTEM ] A processor 
failed, forming new configuration.
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] CLM 
CONFIGURATION CHANGE
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] New 
Configuration:
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] #011r(0) 
ip(192.168.2.82)
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] #011r(0) 
ip(192.168.2.84)
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] Members Left:
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] Members Joined:
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] notice: 
pcmk_peer_update: Transitional membership event on ring 32: memb=2, new=0, 
lost=0
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] info: 
pcmk_peer_update: memb: d52-54-77-77-77-01 1084752466
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] info: 
pcmk_peer_update: memb: d52-54-77-77-77-02 1084752468
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] CLM 
CONFIGURATION CHANGE
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] New 
Configuration:
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] #011r(0) 
ip(192.168.2.82)
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] #011r(0) 
ip(192.168.2.84)
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] Members Left:
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] Members Joined:
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] notice: 
pcmk_peer_update: Stable membership event on ring 32: memb=2, new=0, lost=0
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] info: 
pcmk_peer_update: MEMB: d52-54-77-77-77-01 1084752466
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] info: 
pcmk_peer_update: MEMB: d52-54-77-77-77-02 1084752468
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [TOTEM ] A processor 
joined or left the membership and a new membership was formed.
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CPG   ] chosen downlist: 
sender r(0) ip(192.168.2.82) ; members(old:2 left:0)
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [MAIN  ] Completed 
service synchronization, ready to provide service.

I don't understand why it claims a processor failed, forming a new
configuration, when the configuration appears no different from
before: no members joined or left.  Can anyone explain this?

Corosync uses token (very similar to old token-ring, in corosync usedfor concession control (only node with token can send messages) andordering of messages) timeout (= maximum time to wait for token. Iftoken doesn't arrive it's considered to be lost) to detect problems innetwork or more generally failed nodes. If corosync is not scheduled fora long time, it's same situation (from the affected node point of view)as lost token. Corosync can find out if it was not scheduled for tokentimeout but there is really no change in following steps. Node reallycannot be sure if other nodes didn't create different membership(without affected node) so it has to go thru gather state (contact allnodes, get their world view, decide) even if (for given node) nothingreally changed (other nodes may see it differently).


Hope it helps a bit.

Regards,
  Honza


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Corosync main process was not scheduled for 115935.2266 ms (threshold is 800.0000 ms). Consider token timeout increase.

Reply via email to