Hi,
I'm having what I think is a timeouts issue in my cluster.
I have a two node cluster using qdisk. Everytime the node that has the master
role for qdisk becomes down (for failure or even stopping qdiskd manually),
packages in the sane node are stopped because of the lack of quorum as the
qdiskd becames unresponsive until second node becames master node and start
working properly. Once qdiskd start working fine (usually 5-6 seconds) packages
are started again.
I've read in the cluster manual section for "CMAN membership timeout value" and
I think this is the case. I've used RHEL 5.3 and I thought this parameter is
the token that I set much longer that needed:
<cluster alias="CLUSTER_ENG" config_version="75" name="CLUSTER_ENG">
<totem token="50000"/>
...
<quorumd device="/dev/mapper/mpathquorump1" interval="3"
status_file="/tmp/qdisk" tko="3" votes="5" log_level="7" log_facility="local4"/>
Totem token is much more that double of qdisk timeout, so I guess it should be
enough but everytime qdisk dies in the master node I get same result, services
restarted in the sane node:
Jun 15 16:11:33 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
(2/3)
Jun 15 16:11:38 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
(3/3)
Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
(4/3)
Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 DOWN
Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Making bid for master
Jun 15 16:11:44 rmamseslab07 clurgmgrd: [18510]: <info> Executing
/etc/init.d/watchdog status
Jun 15 16:11:48 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
(5/3)
Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
(6/3)
Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: <info> Assuming master role
Message from sysl...@rmamseslab07 at Jun 15 16:11:53 ...
clurgmgrd[18510]: <emerg> #1: Quorum Dissolved
Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] lost contact with quorum
device
Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] quorum lost, blocking
activity
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Membership Change Event
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <emerg> #1: Quorum Dissolved
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of
service:Cluster_test_2
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of
service:wdtcscript-rmamseslab05-ic
Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of
service:wdtcscript-rmamseslab07-ic
Jun 15 16:11:54 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of
service:Logical volume 1
Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
(7/3)
Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <notice> Writing eviction notice
for node 1
Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Telling CMAN to kill the
node
Jun 15 16:11:58 rmamseslab07 openais[14087]: [CMAN ] quorum regained, resuming
activity
I've just logged a case but... any idea????
Regards,
Alfredo Moralejo
Business Platforms Engineering - OS Servers - UNIX Senior Specialist
F. Hoffmann-La Roche Ltd.
Global Informatics Group Infrastructure
Josefa Valcárcel, 40
28027 Madrid SPAIN
Phone: +34 91 305 97 87
[email protected]<mailto:[email protected]>
Confidentiality Note: This message is intended only for the use of the named
recipient(s) and may contain confidential and/or proprietary information. If
you are not the intended recipient, please contact the sender and delete this
message. Any unauthorized use of the information contained in this message is
prohibited.
--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster