Hi Dan,

Dan Frincu wrote:
> Hi,
> 
> Ryan Steele wrote: 
>
>> Steven Dake wrote: 
>>>
>>> That is how it is supposed to work.  Any interface that is faulty within
>>> one ring will mark the entire ring faulty.  To reenable the ring, run
>>> corosync-cfgtool -r (once the faulty network condition has been repaired).
>>>
>>
>> Even if every other interface on that ring is working fine?  Why isn't the 
>> node with the faulty interface segregated, so
>> the rest can continue to converse on that otherwise healthy ring?  That 
>> would exponentially increase the resiliency of
>> the rings, and it is much easier to scale with nodes than it is with 
>> interfaces, especially with the density being a big
>> trend in datacenters.  I can fit more twin-nodes in my racks than I can 
>> interfaces on half a chassis.
>> 
>   
> Corosync is using the Totem Single-Ring Ordering and Membership Protocol
> [1] as a base for how it manages membership of nodes, it performs the
> same token passing logic found in token ring networks, but by using
> Ethernet as the network infrastructure. In a token ring network, what
> happened when one of the links in the ring was broken? The entire ring
> was down. Therefore in order to ensure availability, token ring networks
> were designed with 2 redundant rings. The same is true for this
> architecture.
> 
> HTH,
> Dan
> 
> 1.
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.767&rep=rep1&type=pdf

Thank you very much for the link, it's an interesting read, and I see the 
limitations.  I also have several suggestions
as to how they can be overcome.


1. Use a mesh network.  They're not just for wireless anymore, and they're 
about as fault tolerant as you can get.
There are a few open source implementations out there.

2. Leverage a C-ring topology when one of the ring members loses one or more 
interfaces, similar to FDDI/CDDI.  This
way, if a node loses it's physical link on all ring interfaces, the two nodes 
on either side of it become endpoints at
which the data make a U-turn.

3. Force a cluster reconfiguration if the ring is broken, sending out joins to 
all previous members, and reconstructing
the rings with the healthy members.  I think automatically doing the latter is 
part of the roadmap, if I understand that
document correctly.

I confess I'm no clustering expert, so I may be barking up the wrong tree (feel 
free to cluebat if necessary), but two
interface failures shouldn't bring down an entire two-ring N-node cluster. 
There are probably other possibilities as
well... I know that Spread advertises N-1 failures, but I think they might use 
a few different network topologies (ring,
hop, and something else), and I'm not quite sure how it's implemented, so I'm 
not comfortable saying that it's a viable
(or even recommended) option, in parts or as a whole.  Just trying to 
understand the limitations offer some suggestions
as to how they might be addressed.  Thanks for all the hard work on the 
project, and for your discussion on this topic.

-Ryan
_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to