Martin Schlegel napsal(a):
Thanks for the confirmation Jan, but this sounds a bit scary to me !

Spinning this experiment a bit further ...

Would this not also mean that with a passive rrp with 2 rings it only takes 2
different nodes that are not able to communicate on different networks at the
same time to have all rings marked faulty on _every_node ... therefore all
cluster members loosing quorum immediately even though n-2 cluster members are
technically able to send and receive heartbeat messages through all 2 rings ?

Not exactly, but this situation causes corosync to start behaving really badly spending most of the time in "creating new membership" loop.

Yes, RRP is simply bad. If you can, use bonding. Improvement of RRP by replace it by knet is biggest TODO for 3.x.

Regards,
  Honza


I really hope the answer is no and the cluster still somehow has a quorum in
this case.

Regards,
Martin Schlegel


Jan Friesse <jfrie...@redhat.com> hat am 5. Oktober 2016 um 09:01 geschrieben:

Martin,

Hello all,

I am trying to understand why the following 2 Corosync heartbeat ring
failure
scenarios
I have been testing and hope somebody can explain why this makes any sense.

Consider the following cluster:

  * 3x Nodes: A, B and C
  * 2x NICs for each Node
  * Corosync 2.3.5 configured with "rrp_mode: passive" and
  udpu transport with ring id 0 and 1 on each node.
  * On each node "corosync-cfgtool -s" shows:
  [...] ring 0 active with no faults
  [...] ring 1 active with no faults

Consider the following scenarios:

  1. On node A only block all communication on the first NIC configured with
ring id 0
  2. On node A only block all communication on all NICs configured with
ring id 0 and 1

The result of the above scenarios is as follows:

  1. Nodes A, B and C (!) display the following ring status:
  [...] Marking ringid 0 interface <IP-Address> FAULTY
  [...] ring 1 active with no faults
  2. Node A is shown as OFFLINE - B and C display the following ring status:
  [...] ring 0 active with no faults
  [...] ring 1 active with no faults

Questions:
  1. Is this the expected outcome ?

Yes

2. In experiment 1. B and C can still communicate with each other over both
NICs, so why are
  B and C not displaying a "no faults" status for ring id 0 and 1 just like
in experiment 2.

Because this is how RRP works. RRP marks whole ring as failed so every
node sees that ring as failed.

when node A is completely unreachable ?

Because it's different scenario. In scenario 1 there are 3 nodes
membership where one of them has failed one ring -> whole ring is
failed. In scenario 2 there are 2 nodes membership where both rings
works as expected. Node A is completely unreachable and it's not in the
membership.

Regards,
  Honza

Regards,
Martin Schlegel

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to