Sven,

Hi,
we were running a corosync config including 2 Rings for about 2.5 years on a two node NFS Cluster (active/passive). The first ring (ring 0) is configured on a dedicated NIC for Cluster internal communications. The second ring (ring 1) was configured in the interface where the NFS Service is running on. I thought, corosync would primarily use ring 0 and as fallback ring 1. But I was totally

Corosync 2.x in RRP mode is always using all healthy rings. Difference between active and passive is that active is sending all messages via both rings and passive is sending packets in round-robin fashion.

wrong. Since this cluster is in production, we've seen this messages all the time:

[TOTEM ] Marking ringid 1 interface 10.7.2.101 FAULTY
[TOTEM ] Automatically recovered ring 1

But everything seemed to be OK, until last Friday. First symptom was a dying 
MySQL Replication caused by massive network load. Some time later on, our Web 
Servers were failing. They went into socket time outs, caused by not reaching 
their NFS Shares (on that named cluster before). We encountered massive packet 
loss between  10% and 90% within our LAN. Especially the the NFS Cluster nodes. 
We were searching for several causes. Network loops, dying Switches and some 
other possibilities. But at the end we changed the corosync config to use only 
one ring (ring 0) and restarted the whole NFS Cluster. It took some Minutes, 
but the Network issues were gone. No Paket Loss on the NFS Cluster anymore.

So my guess is, that the Multicast Traffic of corosync killed our switches so they began to drop

If user space application (corosync) really can kill switch (HW) then I would consider throwing away such switch.

network packets because of that over load caused by corosync. Is an issue like that already known? Has any one a clue what was going on?

I think it's ifdown problem. If you shutdown all the nodes and start them again with two rings does everything works?

Anyway. RRP is broken, that's why it was replaced in 3.x completely by knet. So If you want to stay with 2.x (eventho on Leap 42.3 you may consider updating to 3.x) then I would recommend to stay with single ring and use bonding/teaming if possible.

Honza


Here are the system specs/configs
OS: openSUSE Leap 42.3
Kernel: 4.4.126-48-default

Installed Cluster packages:
rpm -qa | grep -Ei "(corosync|pace|cluster)"
libpacemaker3-1.1.16-3.6.x86_64
pacemaker-cli-1.1.16-3.6.x86_64
pacemaker-cts-1.1.16-3.6.x86_64
libcorosync4-2.3.6-7.1.x86_64
pacemaker-1.1.16-3.6.x86_64
pacemaker-remote-1.1.16-3.6.x86_64

Used Corosync Config at the time of the incident:
totem {
        version: 2
        cluster_name: nfs-cluster

        crypto_cipher: aes256
        crypto_hash: sha1

        clear_node_high_bit: yes


        rrp_mode: passive

        interface {
                ringnumber: 0
                bindnetaddr: 10.7.0.0
                mcastaddr: 239.255.54.33
                mcastport: 5417
                ttl: 2
        }
        interface {
                ringnumber: 1
                bindnetaddr: 10.7.2.0
                mcastaddr: 239.255.54.35
                mcastport: 5417
                ttl: 2
        }
}

logging {
        fileline: off
        to_stderr: no
        to_syslog: yes
        debug: off
        timestamp: on
        logger_subsys {
                subsys: QUORUM
                debug: off
        }
}

quorum {
        provider: corosync_votequorum
        expected_votes: 2
        two_node: 1
        wait_for_all: 0
}


kind regards
Sven
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to