[corosync] Corosync quorum not updating on split node

Mark Round Thu, 05 Sep 2013 04:55:57 -0700

Hi all,

I have a problem whereby when I create a network split/partition (by dropping 
traffic with iptables), the victim node for some reason does not realise it has 
split from the network. If I split a cluster into two partitions both with 
multiple nodes, one with quorum and one without, then things function as 
expected; it just appears that a single node on it's own can't work out that it 
doesn't have quorum if it has no other nodes to talk to.


A single victim node seems to recognise that it can't form a cluster due to 
network issues, but the status is not reflected in the output from 
corosync-quorumtool, and cluster services (via pacemaker) still continue to 
run. However, the other nodes in the rest of the cluster do realise they have 
lost contact with a node, no longer have quorum and correctly shut down 
services.

When I block traffic on the victim node's eth0, The remaining nodes see that 
they cannot communicate with it and shutdown :

# corosync-quorumtool -s
Version:          1.4.5
Nodes:            3
Ring ID:          696
Quorum type:      corosync_votequorum
Quorate:          No
Node votes:       1
Expected votes:   7
Highest expected: 7
Total votes:      3
Quorum:           4 Activity blocked
Flags:

However, the victim node still thinks everything is fine, and maintains a view 
of the cluster prior to the split :

# corosync-quorumtool -s
Version:          1.4.5
Nodes:            4
Ring ID:          716
Quorum type:      corosync_votequorum
Quorate:          Yes
Node votes:       1
Expected votes:   7
Highest expected: 7
Total votes:      4
Quorum:           4
Flags:            Quorate

However, it does notice in the logs that it cannot now form cluster, as the 
following messages repeat constantly :

corosync [MAIN  ] Totem is unable to form a cluster because of an operating 
system or network fault. The most common cause of this message is that the 
local firewall is configured improperly.

I would expect at this point for it to be in it's own network partition with a 
total of 1 vote, and block activity. However, this does not seem to happen 
until just after it rejoins the cluster. When I unblock traffic and it rejoins, 
I see the victim finally realise it had lost quorum :

Sep 05 09:52:21 corosync [pcmk  ] notice: pcmk_peer_update: Transitional 
membership event on ring 720: memb=1, new=0, lost=3
Sep 05 09:52:21 corosync [VOTEQ ] quorum lost, blocking activity
Sep 05 09:52:21 corosync [QUORUM] This node is within the non-primary component 
and will NOT provide any services.
Sep 05 09:52:21 corosync [QUORUM] Members[1]: 358898186

And a second or so later it regains quorum :

crmd:   notice: ais_dispatch_message:         Membership 736: quorum acquired

So my question is why, when it realises it cannot form a cluster ("Totem in 
unable to form..."), does it not loose quorum, update the status as reflected 
by quorumtool and shutdown cluster services ?

Configuration file example and package versions/environment listed below. I'm 
using "updu" protocol as we need to avoid multicast in this environment; it 
will eventually be using a routed network. This behaviour also persists when I 
disable the pacemaker plugin and just test with corosync.

compatibility: whitetank
totem {
    version: 2
    secauth: off
    interface {
        member {
            memberaddr: 10.90.100.20
        }
        member {
            memberaddr: 10.90.100.21
        }
...
... more nodes snipped
...
ringnumber: 0
        bindnetaddr: 10.90.100.20
        mcastport: 5405
    }
    transport: udpu
}
amf {
        mode: disabled
}
aisexec {
        user: root
        group: root
}
quorum {
        provider: corosync_votequorum
        expected_votes: 7
}
service {
        # Load the Pacemaker Cluster Resource Manager
        name: pacemaker
        ver: 0
}

Environment : CentOS 6.4
Packages from OpenSUSE : 
http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/x86_64/
# rpm -qa | egrep "^(cluster|corosync|crm|libqb|pacemaker|resource-agents)" | 
sort
cluster-glue-1.0.11-3.1.x86_64
cluster-glue-libs-1.0.11-3.1.x86_64
corosync-1.4.5-2.2.x86_64
corosynclib-1.4.5-2.2.x86_64
crmsh-1.2.6-0.rc3.3.1.x86_64
libqb0-0.14.4-1.2.x86_64
pacemaker-1.1.9-2.1.x86_64
pacemaker-cli-1.1.9-2.1.x86_64
pacemaker-cluster-libs-1.1.9-2.1.x86_64
pacemaker-libs-1.1.9-2.1.x86_64
resource-agents-3.9.5-3.1.x86_64

Regards,

-Mark

________________________________

Mark Round
Senior Systems Administrator
NCC Group
Kings Court
Kingston Road
Leatherhead, KT22 7SL

Telephone: +44 1372 383815
Mobile: +44 7790 770413
Fax:
Website: www.nccgroup.com<http://www.nccgroup.com>
Email:  [email protected]<mailto:[email protected]>
        [http://www.nccgroup.com/media/192418/nccgrouplogo.jpg] 
<http://www.nccgroup.com/>
________________________________

This email is sent for and on behalf of NCC Group. NCC Group is the trading 
name of NCC Group Performance Testing Limited (Registered in England CRN: 
4069379). Registered Office: Manchester Technology Centre, Oxford Road, 
Manchester, M1 7EF. The ultimate holding company is NCC Group plc (Registered 
in England CRN: 4627044).

Confidentiality: This e-mail contains proprietary information, some or all of 
which may be confidential and/or legally privileged. It is for the intended 
recipient only. If an addressing or transmission error has misdirected this 
e-mail, please notify the author by replying to this e-mail and then delete the 
original. If you are not the intended recipient you may not use, disclose, 
distribute, copy, print or rely on any information contained in this e-mail. 
You must not inform any other person other than NCC Group or the sender of its 
existence.

For more information about NCC Group please visit 
www.nccgroup.com<http://www.nccgroup.com>

P Before you print think about the ENVIRONMENT


For more information please visit <a 
href="http://www.mimecast.com";>http://www.mimecast.com<br>
This email message has been delivered safely and archived online by Mimecast.
</a>


_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

[corosync] Corosync quorum not updating on split node

Reply via email to