Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-07 Thread Martin Schlegel
Thanks for all responses from Jan, Ulrich and Digimer !

We are already using bond'ed network interfaces, but we are also forced to go
across IP-subnets. Certain routes between routers can go and have gone missing.

This has happened for one of our node's public network, where it was
inaccessible to other local, public IP-subnets. If this were to happen in
parallel on another node of our private network the entire cluster would be
down, just because - as Ulrich said "It's a ring !" - both heartbeat rings are
marked faulty. It's not an optimal result, because cluster communication is in
fact 100% possible between all nodes.

With an increasing number of nodes this risk is fairly big. Just think about
providers of bigger cloud infrastructures.

With the above scenario in mind - is there a better (tested and recommended) way
to configure this ?
... or is knet the way to go in the future then ?


Regards,
Martin Schlegel


> Jan Friesse  hat am 7. Oktober 2016 um 11:28 geschrieben:
> 
> Martin Schlegel napsal(a):
> 
> > Thanks for the confirmation Jan, but this sounds a bit scary to me !
> > 
> > Spinning this experiment a bit further ...
> > 
> > Would this not also mean that with a passive rrp with 2 rings it only takes
> > 2
> > different nodes that are not able to communicate on different networks at
> > the
> > same time to have all rings marked faulty on _every_node ... therefore all
> > cluster members loosing quorum immediately even though n-2 cluster members
> > are
> > technically able to send and receive heartbeat messages through all 2 rings
> > ?
> 
> Not exactly, but this situation causes corosync to start behaving really 
> badly spending most of the time in "creating new membership" loop.
> 
> Yes, RRP is simply bad. If you can, use bonding. Improvement of RRP by 
> replace it by knet is biggest TODO for 3.x.
> 
> Regards,
>  Honza
> 
> > I really hope the answer is no and the cluster still somehow has a quorum in
> > this case.
> > 
> > Regards,
> > Martin Schlegel
> 
> >> Jan Friesse  hat am 5. Oktober 2016 um 09:01
> >> geschrieben:>>
> >> Martin,
> >>
> >>> Hello all,
> >>>
> >>> I am trying to understand why the following 2 Corosync heartbeat ring
> >>> failure
> >>> scenarios
> >>> I have been testing and hope somebody can explain why this makes any
> >>> sense.
> >>>
> >>> Consider the following cluster:
> >>>
> >>> * 3x Nodes: A, B and C
> >>> * 2x NICs for each Node
> >>> * Corosync 2.3.5 configured with "rrp_mode: passive" and
> >>> udpu transport with ring id 0 and 1 on each node.
> >>> * On each node "corosync-cfgtool -s" shows:
> >>> [...] ring 0 active with no faults
> >>> [...] ring 1 active with no faults
> >>>
> >>> Consider the following scenarios:
> >>>
> >>> 1. On node A only block all communication on the first NIC configured with
> >>> ring id 0
> >>> 2. On node A only block all communication on all NICs configured with
> >>> ring id 0 and 1
> >>>
> >>> The result of the above scenarios is as follows:
> >>>
> >>> 1. Nodes A, B and C (!) display the following ring status:
> >>> [...] Marking ringid 0 interface  FAULTY
> >>> [...] ring 1 active with no faults
> >>> 2. Node A is shown as OFFLINE - B and C display the following ring status:
> >>> [...] ring 0 active with no faults
> >>> [...] ring 1 active with no faults
> >>>
> >>> Questions:
> >>> 1. Is this the expected outcome ?
> >>
> >> Yes
> >>
> >>> 2. In experiment 1. B and C can still communicate with each other over
> >>> both
> >>> NICs, so why are
> >>> B and C not displaying a "no faults" status for ring id 0 and 1 just like
> >>> in experiment 2.
> >>
> >> Because this is how RRP works. RRP marks whole ring as failed so every
> >> node sees that ring as failed.
> >>
> >>> when node A is completely unreachable ?
> >>
> >> Because it's different scenario. In scenario 1 there are 3 nodes
> >> membership where one of them has failed one ring -> whole ring is
> >> failed. In scenario 2 there are 2 nodes membership where both rings
> >> works as expected. Node A is completely unreachable and it's not in the
> >> membership.
> >>
> >> Regards,
> >> Honza
> >>
> >>> Regards,
> >>> Martin Schlegel
> >>>
> >>> ___
> >>> Users mailing list: Users@clusterlabs.org
> >>> http://clusterlabs.org/mailman/listinfo/users
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>
> >>>
> 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> >

___
Users mailing list: Users@clusterlabs.org

Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-07 Thread Jan Friesse

Martin Schlegel napsal(a):

Thanks for the confirmation Jan, but this sounds a bit scary to me !

Spinning this experiment a bit further ...

Would this not also mean that with a passive rrp with 2 rings it only takes 2
different nodes that are not able to communicate on different networks at the
same time to have all rings marked faulty on _every_node ... therefore all
cluster members loosing quorum immediately even though n-2 cluster members are
technically able to send and receive heartbeat messages through all 2 rings ?


Not exactly, but this situation causes corosync to start behaving really 
badly spending most of the time in "creating new membership" loop.


Yes, RRP is simply bad. If you can, use bonding. Improvement of RRP by 
replace it by knet is biggest TODO for 3.x.


Regards,
  Honza



I really hope the answer is no and the cluster still somehow has a quorum in
this case.

Regards,
Martin Schlegel



Jan Friesse  hat am 5. Oktober 2016 um 09:01 geschrieben:

Martin,


Hello all,

I am trying to understand why the following 2 Corosync heartbeat ring
failure
scenarios
I have been testing and hope somebody can explain why this makes any sense.

Consider the following cluster:

  * 3x Nodes: A, B and C
  * 2x NICs for each Node
  * Corosync 2.3.5 configured with "rrp_mode: passive" and
  udpu transport with ring id 0 and 1 on each node.
  * On each node "corosync-cfgtool -s" shows:
  [...] ring 0 active with no faults
  [...] ring 1 active with no faults

Consider the following scenarios:

  1. On node A only block all communication on the first NIC configured with
ring id 0
  2. On node A only block all communication on all NICs configured with
ring id 0 and 1

The result of the above scenarios is as follows:

  1. Nodes A, B and C (!) display the following ring status:
  [...] Marking ringid 0 interface  FAULTY
  [...] ring 1 active with no faults
  2. Node A is shown as OFFLINE - B and C display the following ring status:
  [...] ring 0 active with no faults
  [...] ring 1 active with no faults

Questions:
  1. Is this the expected outcome ?


Yes


2. In experiment 1. B and C can still communicate with each other over both
NICs, so why are
  B and C not displaying a "no faults" status for ring id 0 and 1 just like
in experiment 2.


Because this is how RRP works. RRP marks whole ring as failed so every
node sees that ring as failed.


when node A is completely unreachable ?


Because it's different scenario. In scenario 1 there are 3 nodes
membership where one of them has failed one ring -> whole ring is
failed. In scenario 2 there are 2 nodes membership where both rings
works as expected. Node A is completely unreachable and it's not in the
membership.

Regards,
  Honza


Regards,
Martin Schlegel

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org






___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-06 Thread Dimitri Maziuk
PS. in security handling everything at one (high) level is known as
"hard crunchy shell with soft chewy center". It's not seen as a good thing.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu





signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-06 Thread Dimitri Maziuk
On 10/06/2016 11:25 AM, Klaus Wenninger wrote:

> But it is convenient because all layers on top can be completely
> agnostic of the duplicity.

It's also cheap: failing over a node, esp. when taking over involves
replaying a database log, or even just re-establishing a bunch of nfs
connections, is way more disruptive than mdadm going into "degraded"
state and sending you an e-mail.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-06 Thread Dimitri Maziuk
On 10/06/2016 09:26 AM, Klaus Wenninger wrote:

> Usually one - at least me so far - would rather think that having
> the awareness of redundany/cluster as high up as possible in the
> protocol/application-stack would open up possibilities for more
> appropriate reactions.

The obvious counter-example is a hard disk failure: they're common on
commodity spinning rust drives and they're cheap and easy to handle at
lower level by throwing in a 2nd one in mdadm raid-1.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-06 Thread Martin Schlegel
Thanks for the confirmation Jan, but this sounds a bit scary to me !

Spinning this experiment a bit further ...

Would this not also mean that with a passive rrp with 2 rings it only takes 2
different nodes that are not able to communicate on different networks at the
same time to have all rings marked faulty on _every_node ... therefore all
cluster members loosing quorum immediately even though n-2 cluster members are
technically able to send and receive heartbeat messages through all 2 rings ?

I really hope the answer is no and the cluster still somehow has a quorum in
this case.


Regards,
Martin Schlegel

> Jan Friesse  hat am 5. Oktober 2016 um 09:01 geschrieben:
> 
> Martin,
> 
> > Hello all,
> > 
> > I am trying to understand why the following 2 Corosync heartbeat ring
> > failure
> > scenarios
> > I have been testing and hope somebody can explain why this makes any sense.
> > 
> > Consider the following cluster:
> > 
> >  * 3x Nodes: A, B and C
> >  * 2x NICs for each Node
> >  * Corosync 2.3.5 configured with "rrp_mode: passive" and
> >  udpu transport with ring id 0 and 1 on each node.
> >  * On each node "corosync-cfgtool -s" shows:
> >  [...] ring 0 active with no faults
> >  [...] ring 1 active with no faults
> > 
> > Consider the following scenarios:
> > 
> >  1. On node A only block all communication on the first NIC configured with
> > ring id 0
> >  2. On node A only block all communication on all NICs configured with
> > ring id 0 and 1
> > 
> > The result of the above scenarios is as follows:
> > 
> >  1. Nodes A, B and C (!) display the following ring status:
> >  [...] Marking ringid 0 interface  FAULTY
> >  [...] ring 1 active with no faults
> >  2. Node A is shown as OFFLINE - B and C display the following ring status:
> >  [...] ring 0 active with no faults
> >  [...] ring 1 active with no faults
> > 
> > Questions:
> >  1. Is this the expected outcome ?
> 
> Yes
> 
> > 2. In experiment 1. B and C can still communicate with each other over both
> > NICs, so why are
> >  B and C not displaying a "no faults" status for ring id 0 and 1 just like
> > in experiment 2.
> 
> Because this is how RRP works. RRP marks whole ring as failed so every 
> node sees that ring as failed.
> 
> > when node A is completely unreachable ?
> 
> Because it's different scenario. In scenario 1 there are 3 nodes 
> membership where one of them has failed one ring -> whole ring is 
> failed. In scenario 2 there are 2 nodes membership where both rings 
> works as expected. Node A is completely unreachable and it's not in the 
> membership.
> 
> Regards,
>  Honza
> 
> > Regards,
> > Martin Schlegel
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> >

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-04 Thread Martin Schlegel
Hello all,

I am trying to understand why the following 2 Corosync heartbeat ring failure
scenarios 
I have been testing and hope somebody can explain why this makes any sense.


Consider the following cluster:

* 3x Nodes: A, B and C
* 2x NICs for each Node
* Corosync 2.3.5 configured with "rrp_mode: passive" and 
  udpu transport with ring id 0 and 1 on each node.
* On each node "corosync-cfgtool -s" shows:
[...] ring 0 active with no faults
[...] ring 1 active with no faults


Consider the following scenarios:

1. On node A only block all communication on the first NIC  configured with
ring id 0
2. On node A only block all communication on all   NICs configured with
ring id 0 and 1


The result of the above scenarios is as follows:

1. Nodes A, B and C (!) display the following ring status:
[...] Marking ringid 0 interface  FAULTY
[...] ring 1 active with no faults
2. Node A is shown as OFFLINE - B and C display the following ring status:
[...] ring 0 active with no faults
[...] ring 1 active with no faults


Questions:
1. Is this the expected outcome ?
2. In experiment 1. B and C can still communicate with each other over both
NICs, so why are 
   B and C not displaying a "no faults" status for ring id 0 and 1 just like
in experiment 2. 
   when node A is completely unreachable ?


Regards,
Martin Schlegel

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org