Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)
Thanks for all responses from Jan, Ulrich and Digimer ! We are already using bond'ed network interfaces, but we are also forced to go across IP-subnets. Certain routes between routers can go and have gone missing. This has happened for one of our node's public network, where it was inaccessible to other local, public IP-subnets. If this were to happen in parallel on another node of our private network the entire cluster would be down, just because - as Ulrich said "It's a ring !" - both heartbeat rings are marked faulty. It's not an optimal result, because cluster communication is in fact 100% possible between all nodes. With an increasing number of nodes this risk is fairly big. Just think about providers of bigger cloud infrastructures. With the above scenario in mind - is there a better (tested and recommended) way to configure this ? ... or is knet the way to go in the future then ? Regards, Martin Schlegel > Jan Friessehat am 7. Oktober 2016 um 11:28 geschrieben: > > Martin Schlegel napsal(a): > > > Thanks for the confirmation Jan, but this sounds a bit scary to me ! > > > > Spinning this experiment a bit further ... > > > > Would this not also mean that with a passive rrp with 2 rings it only takes > > 2 > > different nodes that are not able to communicate on different networks at > > the > > same time to have all rings marked faulty on _every_node ... therefore all > > cluster members loosing quorum immediately even though n-2 cluster members > > are > > technically able to send and receive heartbeat messages through all 2 rings > > ? > > Not exactly, but this situation causes corosync to start behaving really > badly spending most of the time in "creating new membership" loop. > > Yes, RRP is simply bad. If you can, use bonding. Improvement of RRP by > replace it by knet is biggest TODO for 3.x. > > Regards, > Honza > > > I really hope the answer is no and the cluster still somehow has a quorum in > > this case. > > > > Regards, > > Martin Schlegel > > >> Jan Friesse hat am 5. Oktober 2016 um 09:01 > >> geschrieben:>> > >> Martin, > >> > >>> Hello all, > >>> > >>> I am trying to understand why the following 2 Corosync heartbeat ring > >>> failure > >>> scenarios > >>> I have been testing and hope somebody can explain why this makes any > >>> sense. > >>> > >>> Consider the following cluster: > >>> > >>> * 3x Nodes: A, B and C > >>> * 2x NICs for each Node > >>> * Corosync 2.3.5 configured with "rrp_mode: passive" and > >>> udpu transport with ring id 0 and 1 on each node. > >>> * On each node "corosync-cfgtool -s" shows: > >>> [...] ring 0 active with no faults > >>> [...] ring 1 active with no faults > >>> > >>> Consider the following scenarios: > >>> > >>> 1. On node A only block all communication on the first NIC configured with > >>> ring id 0 > >>> 2. On node A only block all communication on all NICs configured with > >>> ring id 0 and 1 > >>> > >>> The result of the above scenarios is as follows: > >>> > >>> 1. Nodes A, B and C (!) display the following ring status: > >>> [...] Marking ringid 0 interface FAULTY > >>> [...] ring 1 active with no faults > >>> 2. Node A is shown as OFFLINE - B and C display the following ring status: > >>> [...] ring 0 active with no faults > >>> [...] ring 1 active with no faults > >>> > >>> Questions: > >>> 1. Is this the expected outcome ? > >> > >> Yes > >> > >>> 2. In experiment 1. B and C can still communicate with each other over > >>> both > >>> NICs, so why are > >>> B and C not displaying a "no faults" status for ring id 0 and 1 just like > >>> in experiment 2. > >> > >> Because this is how RRP works. RRP marks whole ring as failed so every > >> node sees that ring as failed. > >> > >>> when node A is completely unreachable ? > >> > >> Because it's different scenario. In scenario 1 there are 3 nodes > >> membership where one of them has failed one ring -> whole ring is > >> failed. In scenario 2 there are 2 nodes membership where both rings > >> works as expected. Node A is completely unreachable and it's not in the > >> membership. > >> > >> Regards, > >> Honza > >> > >>> Regards, > >>> Martin Schlegel > >>> > >>> ___ > >>> Users mailing list: Users@clusterlabs.org > >>> http://clusterlabs.org/mailman/listinfo/users > >>> > >>> Project Home: http://www.clusterlabs.org > >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >>> Bugs: http://bugs.clusterlabs.org > >> > >>> > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > ___ Users mailing list: Users@clusterlabs.org
Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)
Martin Schlegel napsal(a): Thanks for the confirmation Jan, but this sounds a bit scary to me ! Spinning this experiment a bit further ... Would this not also mean that with a passive rrp with 2 rings it only takes 2 different nodes that are not able to communicate on different networks at the same time to have all rings marked faulty on _every_node ... therefore all cluster members loosing quorum immediately even though n-2 cluster members are technically able to send and receive heartbeat messages through all 2 rings ? Not exactly, but this situation causes corosync to start behaving really badly spending most of the time in "creating new membership" loop. Yes, RRP is simply bad. If you can, use bonding. Improvement of RRP by replace it by knet is biggest TODO for 3.x. Regards, Honza I really hope the answer is no and the cluster still somehow has a quorum in this case. Regards, Martin Schlegel Jan Friessehat am 5. Oktober 2016 um 09:01 geschrieben: Martin, Hello all, I am trying to understand why the following 2 Corosync heartbeat ring failure scenarios I have been testing and hope somebody can explain why this makes any sense. Consider the following cluster: * 3x Nodes: A, B and C * 2x NICs for each Node * Corosync 2.3.5 configured with "rrp_mode: passive" and udpu transport with ring id 0 and 1 on each node. * On each node "corosync-cfgtool -s" shows: [...] ring 0 active with no faults [...] ring 1 active with no faults Consider the following scenarios: 1. On node A only block all communication on the first NIC configured with ring id 0 2. On node A only block all communication on all NICs configured with ring id 0 and 1 The result of the above scenarios is as follows: 1. Nodes A, B and C (!) display the following ring status: [...] Marking ringid 0 interface FAULTY [...] ring 1 active with no faults 2. Node A is shown as OFFLINE - B and C display the following ring status: [...] ring 0 active with no faults [...] ring 1 active with no faults Questions: 1. Is this the expected outcome ? Yes 2. In experiment 1. B and C can still communicate with each other over both NICs, so why are B and C not displaying a "no faults" status for ring id 0 and 1 just like in experiment 2. Because this is how RRP works. RRP marks whole ring as failed so every node sees that ring as failed. when node A is completely unreachable ? Because it's different scenario. In scenario 1 there are 3 nodes membership where one of them has failed one ring -> whole ring is failed. In scenario 2 there are 2 nodes membership where both rings works as expected. Node A is completely unreachable and it's not in the membership. Regards, Honza Regards, Martin Schlegel ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)
PS. in security handling everything at one (high) level is known as "hard crunchy shell with soft chewy center". It's not seen as a good thing. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)
On 10/06/2016 11:25 AM, Klaus Wenninger wrote: > But it is convenient because all layers on top can be completely > agnostic of the duplicity. It's also cheap: failing over a node, esp. when taking over involves replaying a database log, or even just re-establishing a bunch of nfs connections, is way more disruptive than mdadm going into "degraded" state and sending you an e-mail. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)
On 10/06/2016 09:26 AM, Klaus Wenninger wrote: > Usually one - at least me so far - would rather think that having > the awareness of redundany/cluster as high up as possible in the > protocol/application-stack would open up possibilities for more > appropriate reactions. The obvious counter-example is a hard disk failure: they're common on commodity spinning rust drives and they're cheap and easy to handle at lower level by throwing in a 2nd one in mdadm raid-1. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)
Thanks for the confirmation Jan, but this sounds a bit scary to me ! Spinning this experiment a bit further ... Would this not also mean that with a passive rrp with 2 rings it only takes 2 different nodes that are not able to communicate on different networks at the same time to have all rings marked faulty on _every_node ... therefore all cluster members loosing quorum immediately even though n-2 cluster members are technically able to send and receive heartbeat messages through all 2 rings ? I really hope the answer is no and the cluster still somehow has a quorum in this case. Regards, Martin Schlegel > Jan Friessehat am 5. Oktober 2016 um 09:01 geschrieben: > > Martin, > > > Hello all, > > > > I am trying to understand why the following 2 Corosync heartbeat ring > > failure > > scenarios > > I have been testing and hope somebody can explain why this makes any sense. > > > > Consider the following cluster: > > > > * 3x Nodes: A, B and C > > * 2x NICs for each Node > > * Corosync 2.3.5 configured with "rrp_mode: passive" and > > udpu transport with ring id 0 and 1 on each node. > > * On each node "corosync-cfgtool -s" shows: > > [...] ring 0 active with no faults > > [...] ring 1 active with no faults > > > > Consider the following scenarios: > > > > 1. On node A only block all communication on the first NIC configured with > > ring id 0 > > 2. On node A only block all communication on all NICs configured with > > ring id 0 and 1 > > > > The result of the above scenarios is as follows: > > > > 1. Nodes A, B and C (!) display the following ring status: > > [...] Marking ringid 0 interface FAULTY > > [...] ring 1 active with no faults > > 2. Node A is shown as OFFLINE - B and C display the following ring status: > > [...] ring 0 active with no faults > > [...] ring 1 active with no faults > > > > Questions: > > 1. Is this the expected outcome ? > > Yes > > > 2. In experiment 1. B and C can still communicate with each other over both > > NICs, so why are > > B and C not displaying a "no faults" status for ring id 0 and 1 just like > > in experiment 2. > > Because this is how RRP works. RRP marks whole ring as failed so every > node sees that ring as failed. > > > when node A is completely unreachable ? > > Because it's different scenario. In scenario 1 there are 3 nodes > membership where one of them has failed one ring -> whole ring is > failed. In scenario 2 there are 2 nodes membership where both rings > works as expected. Node A is completely unreachable and it's not in the > membership. > > Regards, > Honza > > > Regards, > > Martin Schlegel > > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)
Hello all, I am trying to understand why the following 2 Corosync heartbeat ring failure scenarios I have been testing and hope somebody can explain why this makes any sense. Consider the following cluster: * 3x Nodes: A, B and C * 2x NICs for each Node * Corosync 2.3.5 configured with "rrp_mode: passive" and udpu transport with ring id 0 and 1 on each node. * On each node "corosync-cfgtool -s" shows: [...] ring 0 active with no faults [...] ring 1 active with no faults Consider the following scenarios: 1. On node A only block all communication on the first NIC configured with ring id 0 2. On node A only block all communication on all NICs configured with ring id 0 and 1 The result of the above scenarios is as follows: 1. Nodes A, B and C (!) display the following ring status: [...] Marking ringid 0 interface FAULTY [...] ring 1 active with no faults 2. Node A is shown as OFFLINE - B and C display the following ring status: [...] ring 0 active with no faults [...] ring 1 active with no faults Questions: 1. Is this the expected outcome ? 2. In experiment 1. B and C can still communicate with each other over both NICs, so why are B and C not displaying a "no faults" status for ring id 0 and 1 just like in experiment 2. when node A is completely unreachable ? Regards, Martin Schlegel ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org