Ferenc,
Hi,
In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
(in August; in May, it happened 0-2 times a day only, it's slowly
ramping up):
vhbl08 corosync[3687]: [TOTEM ] A processor failed, forming new configuration.
vhbl03 corosync[3890]: [TOTEM ] A processor failed, forming new configuration.
vhbl07 corosync[3805]: [MAIN ] Corosync main process was not scheduled for
4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout increase.
^^^ This is main problem you have to solve. It usually means that
machine is too overloaded. It is happening quite often when corosync is
running inside VM where host machine is unable to schedule regular VM
running.
As a start you can try what message say = Consider token timeout
increase. Currently you have 3 seconds, in theory 6 second should be enough.
Honza
vhbl07 corosync[3805]: [TOTEM ] A processor failed, forming new configuration.
vhbl04 corosync[3759]: [TOTEM ] A new membership (10.0.6.9:3056) was formed.
Members
vhbl05 corosync[3919]: [TOTEM ] A new membership (10.0.6.9:3056) was formed.
Members
vhbl06 corosync[3759]: [TOTEM ] A new membership (10.0.6.9:3056) was formed.
Members
vhbl07 corosync[3805]: [TOTEM ] A new membership (10.0.6.9:3056) was formed.
Members
vhbl08 corosync[3687]: [TOTEM ] A new membership (10.0.6.9:3056) was formed.
Members
vhbl03 corosync[3890]: [TOTEM ] A new membership (10.0.6.9:3056) was formed.
Members
vhbl07 corosync[3805]: [QUORUM] Members[6]: 167773705 167773706 167773707
167773708 167773709 167773710
vhbl08 corosync[3687]: [QUORUM] Members[6]: 167773705 167773706 167773707
167773708 167773709 167773710
vhbl06 corosync[3759]: [QUORUM] Members[6]: 167773705 167773706 167773707
167773708 167773709 167773710
vhbl07 corosync[3805]: [MAIN ] Completed service synchronization, ready to
provide service.
vhbl04 corosync[3759]: [QUORUM] Members[6]: 167773705 167773706 167773707
167773708 167773709 167773710
vhbl08 corosync[3687]: [MAIN ] Completed service synchronization, ready to
provide service.
vhbl06 corosync[3759]: [MAIN ] Completed service synchronization, ready to
provide service.
vhbl04 corosync[3759]: [MAIN ] Completed service synchronization, ready to
provide service.
vhbl05 corosync[3919]: [QUORUM] Members[6]: 167773705 167773706 167773707
167773708 167773709 167773710
vhbl03 corosync[3890]: [QUORUM] Members[6]: 167773705 167773706 167773707
167773708 167773709 167773710
vhbl05 corosync[3919]: [MAIN ] Completed service synchronization, ready to
provide service.
vhbl03 corosync[3890]: [MAIN ] Completed service synchronization, ready to
provide service.
The cluster is running Corosync 2.4.2 multicast. Those lines really end
at Members, there are no joined of left nodes listed. Pacemaker on top
reacts like:
[9982] vhbl03 pacemakerd: info: pcmk_quorum_notification: Quorum retained
| membership=3056 members=6
[9991] vhbl03 crmd: info: pcmk_quorum_notification: Quorum retained
| membership=3056 members=6
[9986] vhbl03 cib: info: cib_process_request: Completed
cib_modify operation for section nodes: OK (rc=0, origin=vhbl07/crmd/4477,
version=0.1694.12)
[9986] vhbl03 cib: info: cib_process_request: Completed
cib_modify operation for section status: OK (rc=0, origin=vhbl07/crmd/4478,
version=0.1694.12)
[9986] vhbl03 cib: info: cib_process_ping: Reporting our current
digest to vhbl07: 85250f3039d269f96012750f13e232d9 for 0.1694.12
(0x55ef057447d0 0)
on all nodes except for vhbl07, where it says:
[9886] vhbl07 crmd: info: pcmk_quorum_notification: Quorum retained
| membership=3056 members=6
[9877] vhbl07 pacemakerd: info: pcmk_quorum_notification: Quorum retained
| membership=3056 members=6
[9881] vhbl07 cib: info: cib_process_request: Forwarding
cib_modify operation for section nodes to all (origin=local/crmd/
[9881] vhbl07 cib: info: cib_process_request: Forwarding
cib_modify operation for section status to all (origin=local/crmd
[9881] vhbl07 cib: info: cib_process_request: Completed
cib_modify operation for section nodes: OK (rc=0, origin=vhbl07/cr
[9881] vhbl07 cib: info: cib_process_request: Completed
cib_modify operation for section status: OK (rc=0, origin=vhbl07/c
[9881] vhbl07 cib: info: cib_process_ping: Reporting our current
digest to vhbl07: 85250f3039d269f96012750f13e232d9 for 0.1694.
So Pacemaker does nothing, basically, and I can't see any adverse effect
to resource management, but DLM seems to have some problem, which may or
may not be related. When the TOTEM error appears, all nodes log this:
vhbl03 dlm_controld[3914]: 2801675 dlm:controld ring 167773705:3056 6 memb
167773705 167773706 167773707 167773708 167773709 167773710
vhbl03 dlm_controld[3914]: 2801675 fence work wait for cluster ringid
vhbl03 dlm_controld[3914]: 2801675 dlm:ls:clvmd ring 167773705:3056 6 memb
167773705 167773706 167773707 167773708 167773709 167773710
vhbl03 dlm_controld[3914]: 2801675 clvmd wait_messages cg 9 need 1 of 6
vhbl03 dlm_controld[3914]: 2801675 fence work wait for cluster ringid
vhbl03 dlm_controld[3914]: 2801675 cluster quorum 1 seq 3056 nodes 6
dlm_controld is running with --enable_fencing=0. Pacemaker does its own
fencing if resource management requires it, but DLM is used by cLVM
only, which does not warrant such harsh measures. Right now cLVM is
blocked; I don't know since when, because we seldom do cLVM operations
on this cluster. My immediate aim is to unblock cLVM somehow.
While dlm_tool status reports (similar on all nodes):
cluster nodeid 167773705 quorate 1 ring seq 3088 3088
daemon now 2941405 fence_pid 0
node 167773705 M add 196 rem 0 fail 0 fence 0 at 0 0
node 167773706 M add 5960 rem 5730 fail 0 fence 0 at 0 0
node 167773707 M add 2089 rem 1802 fail 0 fence 0 at 0 0
node 167773708 M add 3646 rem 3413 fail 0 fence 0 at 0 0
node 167773709 M add 2588921 rem 2588920 fail 0 fence 0 at 0 0
node 167773710 M add 196 rem 0 fail 0 fence 0 at 0 0
dlm_tool ls shows "kern_stop":
dlm lockspaces
name clvmd
id 0x4104eefa
flags 0x00000004 kern_stop
change member 5 joined 0 remove 1 failed 1 seq 8,8
members 167773705 167773706 167773707 167773708 167773710
new change member 6 joined 1 remove 0 failed 0 seq 9,9
new status wait messages 1
new members 167773705 167773706 167773707 167773708 167773709 167773710
on all nodes except for vhbl07 (167773709), where it gives
dlm lockspaces
name clvmd
id 0x4104eefa
flags 0x00000000
change member 6 joined 1 remove 0 failed 0 seq 11,11
members 167773705 167773706 167773707 167773708 167773709 167773710
instead.
Does anybody have an idea what the problem(s) might be? Why is Corosync
deteriorating on this cluster? (It's running with RR PRIO 99.) Could
that have hurt DLM? Is there a way to unblock DLM without rebooting all
nodes? (Actually, rebooting is problematic in itself with blocked cLVM,
but that's tractable.)
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org