Thanks for the confirmation Jan, but this sounds a bit scary to me ! Spinning this experiment a bit further ...
Would this not also mean that with a passive rrp with 2 rings it only takes 2 different nodes that are not able to communicate on different networks at the same time to have all rings marked faulty on _every_node ... therefore all cluster members loosing quorum immediately even though n-2 cluster members are technically able to send and receive heartbeat messages through all 2 rings ? I really hope the answer is no and the cluster still somehow has a quorum in this case. Regards, Martin Schlegel > Jan Friesse <jfrie...@redhat.com> hat am 5. Oktober 2016 um 09:01 geschrieben: > > Martin, > > > Hello all, > > > > I am trying to understand why the following 2 Corosync heartbeat ring > > failure > > scenarios > > I have been testing and hope somebody can explain why this makes any sense. > > > > Consider the following cluster: > > > > * 3x Nodes: A, B and C > > * 2x NICs for each Node > > * Corosync 2.3.5 configured with "rrp_mode: passive" and > > udpu transport with ring id 0 and 1 on each node. > > * On each node "corosync-cfgtool -s" shows: > > [...] ring 0 active with no faults > > [...] ring 1 active with no faults > > > > Consider the following scenarios: > > > > 1. On node A only block all communication on the first NIC configured with > > ring id 0 > > 2. On node A only block all communication on all NICs configured with > > ring id 0 and 1 > > > > The result of the above scenarios is as follows: > > > > 1. Nodes A, B and C (!) display the following ring status: > > [...] Marking ringid 0 interface <IP-Address> FAULTY > > [...] ring 1 active with no faults > > 2. Node A is shown as OFFLINE - B and C display the following ring status: > > [...] ring 0 active with no faults > > [...] ring 1 active with no faults > > > > Questions: > > 1. Is this the expected outcome ? > > Yes > > > 2. In experiment 1. B and C can still communicate with each other over both > > NICs, so why are > > B and C not displaying a "no faults" status for ring id 0 and 1 just like > > in experiment 2. > > Because this is how RRP works. RRP marks whole ring as failed so every > node sees that ring as failed. > > > when node A is completely unreachable ? > > Because it's different scenario. In scenario 1 there are 3 nodes > membership where one of them has failed one ring -> whole ring is > failed. In scenario 2 there are 2 nodes membership where both rings > works as expected. Node A is completely unreachable and it's not in the > membership. > > Regards, > Honza > > > Regards, > > Martin Schlegel > > > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > http://clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org