Re: [ClusterLabs] continous QUORUM messages in a 3-node cluster (Jan Friesse)

Ilia Sokolinski Tue, 13 Oct 2015 08:49:37 -0700

> ------------------------------
> 
> Message: 2
> Date: Mon, 12 Oct 2015 10:06:01 +0200
> From: Jan Friesse <jfrie...@redhat.com>
> To: Cluster Labs - All topics related to open-source clustering
>       welcomed        <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] continous QUORUM messages in a 3-node
>       cluster
> Message-ID: <561b69e9.1060...@redhat.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Illia,
> 
> 
>> Hi,
>> 
>> We are using a 3-node pacemaker/corosync cluster on CentOs 7.
>> We have several identical setups in our QA/DEV orgs, and a couple of them 
>> continuously spew the following messages on all 3 nodes:
>> 
>> Oct  8 17:18:20 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] A new membership 
>> (10.1.13.134:1553572) was formed. Members
>> Oct  8 17:18:20 42-hw-rig4-L3-2 corosync[15105]: [QUORUM] Members[3]: 2 1 3
>> Oct  8 17:18:20 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] Completed service 
>> synchronization, ready to provide service.
>> Oct  8 17:18:22 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] A new membership 
>> (10.1.13.134:1553576) was formed. Members
>> Oct  8 17:18:22 42-hw-rig4-L3-2 corosync[15105]: [QUORUM] Members[3]: 2 1 3
>> Oct  8 17:18:22 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] Completed service 
>> synchronization, ready to provide service.
>> Oct  8 17:18:24 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] A new membership 
>> (10.1.13.134:1553580) was formed. Members
>> Oct  8 17:18:24 42-hw-rig4-L3-2 corosync[15105]: [QUORUM] Members[3]: 2 1 3
>> Oct  8 17:18:24 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] Completed service 
>> synchronization, ready to provide service.
>> Oct  8 17:18:26 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] A new membership 
>> (10.1.13.134:1553584) was formed. Members
>> Oct  8 17:18:26 42-hw-rig4-L3-2 corosync[15105]: [QUORUM] Members[3]: 2 1 3
>> Oct  8 17:18:26 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] Completed service 
>> synchronization, ready to provide service.
>> 
>> The cluster seems to be generally happy:
>> 
>> root@42-hw-rig4-L3-2 ~]# pcs cluster status
>> Cluster Status:
>>  Last updated: Thu Oct  8 17:24:02 2015
>>  Last change: Thu Oct  8 16:46:57 2015
>>  Stack: corosync
>>  Current DC: dq-ceph9.clearsky-data.net (3) - partition with quorum
>>  Version: 1.1.12-a14efad
>>  3 Nodes configured
>>  17 Resources configured
>> 
>> PCSD Status:
>>   42-hw-back-1.clearsky-data.net: Online
>>   41-hw-back-1.clearsky-data.net: Online
>>   dq-ceph9.clearsky-data.net: Online
>> 
>> The corosync config is:
>> 
>> totem {
>> version: 2
>> secauth: off
>> cluster_name: L3_cluster
>> transport: udpu
>> }
>> 
>> nodelist {
>>   node {
>>         ring0_addr: 42-hw-back-1.clearsky-data.net
>>         nodeid: 1
>>        }
>>   node {
>>         ring0_addr: 41-hw-back-1.clearsky-data.net
>>         nodeid: 2
>>        }
>>   node {
>>         ring0_addr: dq-ceph9.clearsky-data.net
>>         nodeid: 3
>>        }
>> }
>> 
>> quorum {
>> provider: corosync_votequorum
>> 
>> }
>> 
>> logging {
>> to_syslog: yes
>> debug : off
>> }
>> 
>> What do these messages mean, and how can we stop them?
>> 
>> Any help would be very appreciated.
>> 
>> Ilia Sokolinski
>> 
>> PS
>> 
>> I have tried to enable corosync debug and got the following logs:
>> 
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [CFG   ] Config reload 
>> requested by node 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] HUP conn 
>> (15105-46913-25)
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] 
>> qb_ipcs_disconnect(15105-46913-25) state:2
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] epoll_ctl(del): 
>> Bad file descriptor (9)
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] 
>> cs_ipcs_connection_closed()
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] 
>> cs_ipcs_connection_destroyed()
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing 
>> ringbuffer: /dev/shm/qb-cfg-response-15105-46913-25-header
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing 
>> ringbuffer: /dev/shm/qb-cfg-event-15105-46913-25-header
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing 
>> ringbuffer: /dev/shm/qb-cfg-request-15105-46913-25-header
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo 
>> message from cluster node 3
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo 
>> message[3]: votes: 1, expected: 3 flags: 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] flags: quorate: 
>> Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No 
>> QdeviceCastVote: No QdeviceMasterWins: No
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo 
>> message from cluster node 3
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo 
>> message[0]: votes: 0, expected: 0 flags: 0
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo 
>> message from cluster node 2
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo 
>> message[2]: votes: 1, expected: 3 flags: 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] flags: quorate: 
>> Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No 
>> QdeviceCastVote: No QdeviceMasterWins: No
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo 
>> message from cluster node 2
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo 
>> message[0]: votes: 0, expected: 0 flags: 0
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo 
>> message from cluster node 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo 
>> message[1]: votes: 1, expected: 3 flags: 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] flags: quorate: 
>> Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No 
>> QdeviceCastVote: No QdeviceMasterWins: No
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] total_votes=3, 
>> expected_votes=3
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] node 1 state=1, 
>> votes=1, expected=3
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] node 2 state=1, 
>> votes=1, expected=3
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] node 3 state=1, 
>> votes=1, expected=3
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] lowest node id: 1 
>> us: 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] highest node id: 3 
>> us: 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo 
>> message from cluster node 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo 
>> message[0]: votes: 0, expected: 0 flags: 0
>> Oct  8 16:18:48 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] entering GATHER 
>> state from 9(merge during operational state).
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] IPC credentials 
>> authenticated (15105-46923-25)
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] connecting to 
>> client [46923]
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] shm size:1048589; 
>> real_size:1052672; rb->word_size:263168
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] shm size:1048589; 
>> real_size:1052672; rb->word_size:263168
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] shm size:1048589; 
>> real_size:1052672; rb->word_size:263168
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] connection created
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [CMAP  ] lib_init_fn: 
>> conn=0x560430414560
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] HUP conn 
>> (15105-46923-25)
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] 
>> qb_ipcs_disconnect(15105-46923-25) state:2
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] epoll_ctl(del): 
>> Bad file descriptor (9)
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] 
>> cs_ipcs_connection_closed()
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [CMAP  ] exit_fn for 
>> conn=0x560430414560
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] 
>> cs_ipcs_connection_destroyed()
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing 
>> ringbuffer: /dev/shm/qb-cmap-response-15105-46923-25-header
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing 
>> ringbuffer: /dev/shm/qb-cmap-event-15105-46923-25-header
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing 
>> ringbuffer: /dev/shm/qb-cmap-request-15105-46923-25-header
> 
> Line "entering GATHER state from 9(merge during operational state)". is 
> interesting. I believe you have different cluster on the network where 
> you forgot ip addr of one of node in affected cluster. As a simple test, 
> you can try to add
> 
>         interface {
>             ringnumber: 0
>             mcastport: 5409
>         }
> 
> into corosync.conf totem section on all nodes in one of affected cluster 
> and see if problem disappear.
> 
> Honza


Thanks you very much!
I found a couple of old nodes which still were configured to be members of this 
cluster, and destroyed pcs config on them (pcs cluster destroy —force)
The log stopped immediatly.

Ilia
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] continous QUORUM messages in a 3-node cluster (Jan Friesse)

Reply via email to