Thanks for all responses from Jan, Ulrich and Digimer ! We are already using bond'ed network interfaces, but we are also forced to go across IP-subnets. Certain routes between routers can go and have gone missing.
This has happened for one of our node's public network, where it was inaccessible to other local, public IP-subnets. If this were to happen in parallel on another node of our private network the entire cluster would be down, just because - as Ulrich said "It's a ring !" - both heartbeat rings are marked faulty. It's not an optimal result, because cluster communication is in fact 100% possible between all nodes. With an increasing number of nodes this risk is fairly big. Just think about providers of bigger cloud infrastructures. With the above scenario in mind - is there a better (tested and recommended) way to configure this ? ... or is knet the way to go in the future then ? Regards, Martin Schlegel > Jan Friesse <jfrie...@redhat.com> hat am 7. Oktober 2016 um 11:28 geschrieben: > > Martin Schlegel napsal(a): > > > Thanks for the confirmation Jan, but this sounds a bit scary to me ! > > > > Spinning this experiment a bit further ... > > > > Would this not also mean that with a passive rrp with 2 rings it only takes > > 2 > > different nodes that are not able to communicate on different networks at > > the > > same time to have all rings marked faulty on _every_node ... therefore all > > cluster members loosing quorum immediately even though n-2 cluster members > > are > > technically able to send and receive heartbeat messages through all 2 rings > > ? > > Not exactly, but this situation causes corosync to start behaving really > badly spending most of the time in "creating new membership" loop. > > Yes, RRP is simply bad. If you can, use bonding. Improvement of RRP by > replace it by knet is biggest TODO for 3.x. > > Regards, > Honza > > > I really hope the answer is no and the cluster still somehow has a quorum in > > this case. > > > > Regards, > > Martin Schlegel > > >> Jan Friesse <jfrie...@redhat.com> hat am 5. Oktober 2016 um 09:01 > >> geschrieben:>> > >> Martin, > >> > >>> Hello all, > >>> > >>> I am trying to understand why the following 2 Corosync heartbeat ring > >>> failure > >>> scenarios > >>> I have been testing and hope somebody can explain why this makes any > >>> sense. > >>> > >>> Consider the following cluster: > >>> > >>> * 3x Nodes: A, B and C > >>> * 2x NICs for each Node > >>> * Corosync 2.3.5 configured with "rrp_mode: passive" and > >>> udpu transport with ring id 0 and 1 on each node. > >>> * On each node "corosync-cfgtool -s" shows: > >>> [...] ring 0 active with no faults > >>> [...] ring 1 active with no faults > >>> > >>> Consider the following scenarios: > >>> > >>> 1. On node A only block all communication on the first NIC configured with > >>> ring id 0 > >>> 2. On node A only block all communication on all NICs configured with > >>> ring id 0 and 1 > >>> > >>> The result of the above scenarios is as follows: > >>> > >>> 1. Nodes A, B and C (!) display the following ring status: > >>> [...] Marking ringid 0 interface <IP-Address> FAULTY > >>> [...] ring 1 active with no faults > >>> 2. Node A is shown as OFFLINE - B and C display the following ring status: > >>> [...] ring 0 active with no faults > >>> [...] ring 1 active with no faults > >>> > >>> Questions: > >>> 1. Is this the expected outcome ? > >> > >> Yes > >> > >>> 2. In experiment 1. B and C can still communicate with each other over > >>> both > >>> NICs, so why are > >>> B and C not displaying a "no faults" status for ring id 0 and 1 just like > >>> in experiment 2. > >> > >> Because this is how RRP works. RRP marks whole ring as failed so every > >> node sees that ring as failed. > >> > >>> when node A is completely unreachable ? > >> > >> Because it's different scenario. In scenario 1 there are 3 nodes > >> membership where one of them has failed one ring -> whole ring is > >> failed. In scenario 2 there are 2 nodes membership where both rings > >> works as expected. Node A is completely unreachable and it's not in the > >> membership. > >> > >> Regards, > >> Honza > >> > >>> Regards, > >>> Martin Schlegel > >>> > >>> _______________________________________________ > >>> Users mailing list: Users@clusterlabs.org > >>> http://clusterlabs.org/mailman/listinfo/users > >>> > >>> Project Home: http://www.clusterlabs.org > >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>> Bugs: http://bugs.clusterlabs.org > >> > >>> > > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > http://clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org