Adam Spiers napsal(a):
Hi all,

Jan Friesse <jfrie...@redhat.com> wrote:
There is really no help. It's best to make sure corosync is scheduled
regularly.
I may sound silly, but how can I do it?

It's actually very hard to say. Pauses like 30 sec is really unusual
and shouldn't happen (specially with RT scheduling). It is usually
happening on VM where host is overcommitted.

It's funny you are discussing this during the same period where my
team is seeing this happen fairly regularly within VMs on a host which
is overcommitted.  In other words, I can confirm Jan's statement above
is true.

Yep, sadly VM affects scheduling a lot. For cluster nodes it actually really make sense to map every virtual cpu core to physical cpu core.


Like Konstiantyn, we have also sometimes seen no fencing occur as a
result of these pauses, e.g.

Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [MAIN  ] Corosync main 
process was not scheduled for 7343.1909 ms (threshold is 4000.0000 ms). 
Consider token timeout increase.
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [TOTEM ] A processor 
failed, forming new configuration.
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] CLM 
CONFIGURATION CHANGE
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] New 
Configuration:
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] #011r(0) 
ip(192.168.2.82)
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] #011r(0) 
ip(192.168.2.84)
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] Members Left:
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] Members Joined:
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] notice: 
pcmk_peer_update: Transitional membership event on ring 32: memb=2, new=0, 
lost=0
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] info: 
pcmk_peer_update: memb: d52-54-77-77-77-01 1084752466
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] info: 
pcmk_peer_update: memb: d52-54-77-77-77-02 1084752468
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] CLM 
CONFIGURATION CHANGE
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] New 
Configuration:
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] #011r(0) 
ip(192.168.2.82)
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] #011r(0) 
ip(192.168.2.84)
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] Members Left:
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] Members Joined:
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] notice: 
pcmk_peer_update: Stable membership event on ring 32: memb=2, new=0, lost=0
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] info: 
pcmk_peer_update: MEMB: d52-54-77-77-77-01 1084752466
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] info: 
pcmk_peer_update: MEMB: d52-54-77-77-77-02 1084752468
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [TOTEM ] A processor 
joined or left the membership and a new membership was formed.
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CPG   ] chosen downlist: 
sender r(0) ip(192.168.2.82) ; members(old:2 left:0)
Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [MAIN  ] Completed 
service synchronization, ready to provide service.

I don't understand why it claims a processor failed, forming a new
configuration, when the configuration appears no different from
before: no members joined or left.  Can anyone explain this?

Corosync uses token (very similar to old token-ring, in corosync used for concession control (only node with token can send messages) and ordering of messages) timeout (= maximum time to wait for token. If token doesn't arrive it's considered to be lost) to detect problems in network or more generally failed nodes. If corosync is not scheduled for a long time, it's same situation (from the affected node point of view) as lost token. Corosync can find out if it was not scheduled for token timeout but there is really no change in following steps. Node really cannot be sure if other nodes didn't create different membership (without affected node) so it has to go thru gather state (contact all nodes, get their world view, decide) even if (for given node) nothing really changed (other nodes may see it differently).

Hope it helps a bit.

Regards,
  Honza


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to