*From: *Jan Friesse <[email protected]>
*Sent: * 2014-06-20 04:13:42 EDT
*To: *Patrick Hemmer <[email protected]>, [email protected]
*Subject: *Re: [corosync] automatic membership discovery

> Patrick,
>
> Now let's say you really cannot use multicast (what is sadly highly
> probable in cloud environment).
>
> First thing I've totally didn't got is, how whole thing can work
> (reliable) without persistent node list.
There would be a node list, but lets say it's expensive to obtain, such
as having to make a remote call to an external service. So it can be
done, but it can't be used to detect nodes joining and leaving (polling
based, not push).

>
>> *From: *Jan Friesse <[email protected]>
>> *Sent: * 2014-06-19 09:50:17 EDT
>> *To: *Patrick Hemmer <[email protected]>, [email protected]
>> *Subject: *Re: [corosync] automatic membership discovery
>>
>>> Patrick,
>>> so just to recapitulate your idea. Let's say you have cluster with 2
>>> nodes. Now, you will decide to add third node. Your idea is about
>>> properly configure 3rd node (so if we would distribute that config
>>> file,
>>> call reload on every node, everything would work), in other words, add
>>> 3rd node ONLY to config file on 3rd node and then start corosync. Other
>>> nodes will just accept node, add it to their membership (and probably
>>> some kind of automatically generated persistent list of nodes). Do I
>>> understand it correctly?
>>
>> I hadn't considered a persistent storage of nodes as a requirement. But
>
> Ok. Now I'm totally lost. I cannot imagine how this can work WITHOUT
> persistent storage of nodes? I mean, let's say you have 5 nodes. If I
> understood it correctly, their config file will probably looks like:
> 1st node - 1 node (only itself)
> 2nd node - 2 nodes (node 1 and node 2)
> 3rd node - 3 nodes (node 1, node 2, node 3)
> ...
>
> Then everything is ok. But what if user will decide to have config
> with following content:
> 1st node - 1 node (only itself)
> 2nd node - 2 nodes (node 1 and node 2)
> 3rd node - 2 nodes (node 2 and node 3)
> 4th node - 2 nodes (node 3 and node 4)
> 5th node - 2 nodes (node 4 and node 5)
>
> Such config is perfectly valid and when executing nodes in correct
> order you have 5 nodes cluster, right? Now let's say that cluster is
> stopped. You will enter iptables blocking rule so node 1 sees node 2
> (but no other nodes), node 2 sees node 1 (but no other nodes), node 3
> sees node 4 (but no other nodes) and node 4 sees node 3 (but no other
> nodes).
>
> You will start node 1 - 4 and ..., you have TWO perfectly quorate
> clusters, right?
Sorry, I think I see where the confusion is coming in. We can set a rule
such that when corosync starts, it either starts with only itself in the
node list (either waiting to be contacted, or for some external thing to
populate it), or it starts with a full nodelist.
>
>
>> if you wanted to persist the discovered nodes, you could have something
>> (whether corosync, or an external tool) watch the cmap nodelist, and
>> write out to a file when the nodelist changes.
>
> Honestly, I would rather like to talk about use case/higher level
> rather then concrete implementation. I mean, concrete implementation
> make us focused on one way, but there may be other ways. So something
> like storing/not storing nodes, ...
>
>> I didn't consider it a requirement as I considered the possible
>> scenarios that would result in a split brain to be near impossible. For
>> example, if you just have a config file where the only node is itself,
>> when it comes up, it could be made such that it doesn't get consensus
>> until it can contact another node. When it does, that other node would
>> share the quorum info, and perhaps even the nodelist. In the event that
>> any number of nodes fail, the last_man_standing behavior will keep the
>> cluster from going split brain (a node will only be removed from the
>> nodelist if it leaves gracefully, or cluster maintains quorum for the
>> duration of the LMS window).
>> Basically the only scenario I can think of that could result in a split
>> brain is if 2 nodes shut down without a persistent nodelist, and then
>> started back up and were somehow told about each other, but not the rest
>> of the cluster.
>>
>> In fact persistent storage might even be a problem. If a node goes down,
>> and while it's down another node leaves the cluster, when it comes back
>> up, it won't know that node is gone. Though you could solve this by
>> obtaining the nodelist from the rest of the cluster (if the rest of the
>> cluster is quorate).
>>
>> Basically:
>> * A node cannot be quorate by itself.
>
> This is bad requirement. One node cluster is weird (but still used),
> but 2 node cluster (what is actually probably most used scenario)
> where one of node is in maintenance mode is perfectly ok. Such cluster
> HAVE to be quorate.
This wouldn't apply to all usages of corosync. This would be an
operational mode corosync can be in, like two-node is.
>
>> * Corosync will add any node that contacts it to its own node list.
>> * Upon join, the side of the cluster that is quorate will send its
>> quorum information (expected votes, current votes, etc) to the inquorate
>> side.
>
> This is already happening.
Except for things like downscaling.

>
>> * (uncertain) If quorate, corosync may share the nodelist with the rest
>> of the cluster (the new node learns existing nodes & existing nodes
>> learn the new node without it contacting them).
>> * If a node leaves gracefully, it will be removed from the nodelist.
>> * If a node leaves ungracefully, it will be removed if the cluster
>> remains quorate for the duration of last man standing window.
>>
>>
>>>
>>> Because if so, I believe it would mean also change config file, simply
>>> to keep them in sync. And honestly, keeping config file is for sure a
>>> way I would like to go, but that way is very hard. Every single thing
>>> must be very well defined (like what is synchronized and what is not).
>> Yes, I wouldn't consider removing the config file. Though one
>> possibility might be keeping the node list separate from the config
>> file, and letting corosync update that.
>>
>> As simple as the idea is, it may indeed be that this isn't the direction
>> corosync should go. Traditionally corosync has been geared more towards
>> static clusters that don't change often. But with cloud computing
>
> corosync is not designed for static clusters. Actually it's pretty
> opposite. UDPU is newer mode. Original UDP (which still exists, is
> still supported and still default) handles dynamic clusters very well
> (as long as HW is able to do multicast).
I don't see how you can argue this. The allow_downscale feature is very
new, and is still not supported
(https://github.com/corosync/corosync/blob/master/man/votequorum.5#L301).
>
>> becoming so prevalent, the need for dynamic clusters is growing very
>> rapidly. There are several other projects which are implementing this
>> functionality, such as etcd
>> (https://github.com/coreos/etcd/blob/master/Documentation/design/cluster-finding.md)
>>
>> and consul (http://www.consul.io/intro/getting-started/join.html). But
>> these other services tend to be key/value stores, utilize a very heavy
>> protocol (such as http), and don't offer a CPG type service.
>>
>
> first keep in mind that all RAFT based protocols (so both etcd and
> consul) need quorum. Corosync itself doesn't need it (+ pacemaker can
> also work in "without quorum" mode). In other words, when RAFT loose
> quorum, whole cluster is dead and manual intervention is needed. This
> is in STRICT opposite with last-man-standing behavior. I see this as
> bigger blocker then tcp/http/cpg (actually, cpg is pretty easily
> implementable by using key/value store).
>
> So question is, why you cannot use multicast?
You nailed it earlier, cloud networks don't allow multicast or
broadcast. I've even worked for companies who's network admins don't
allow it either.
>
> Other question is, did you tried multicast? If so, is multicast
> behavior something you would like to achieve with UDPU?
Mostly. It seems like it's on the right track, but downscaling is
problematic.

>
> Regards,
>   Honza
>
>>> Regards,
>>>    Honza
>>>
>>> Patrick Hemmer napsal(a):
>>>> From: Patrick Hemmer <[email protected]>
>>>> Sent: 2014-06-16 11:25:40 EDT
>>>> To: Jan Friesse <[email protected]>, [email protected]
>>>> Subject: Re: [corosync] automatic membership discovery
>>>>
>>>>
>>>> On 2014/06/16 11:25, Patrick Hemmer wrote:
>>>>> Patrick,
>>>>>
>>>>>> I'm interested in having corosync automatically accept members
>>>>>> into the
>>>>>> cluster without manual reconfiguration. Meaning that when I bring
>>>>>> a new
>>>>>> node online, I want to configure it for the existing nodes, and
>>>>>> those
>>>>>> nodes will automatically add the new node into their nodelist.
>>>>>>  From a purely technical standpoint, this doesn't seem like it
>>>>>> would be
>>>>>> hard to do. The only 2 things you have to do to add a node are
>>>>>> add the
>>>>>> nodelist.node.X.nodeid and ring0_addr to cmap. When the new node
>>>>>> comes
>>>>>> up, it starts sending out messages to the existing nodes. The
>>>>>> ring0_addr
>>>>>> can be discovered from the source address, and the nodeid is in
>>>>>> the message.
>>>>>>
>>>>> I need to think about this little deeper. It sounds like it may work,
>>>>> but I'm not entirely sure.
>>>>>
>>>>>> Going even further, when using the allow_downscale and
>>>>>> last_man_standing
>>>>>> features, we can automatically remove nodes from the cluster when
>>>>>> they
>>>>>> disappear. With last_man_standing, the quorum expected votes is
>>>>>> automatically adjusted when a node is lost, so it makes no
>>>>>> difference
>>>>>> whether the node is offline, or removed. Then with the auto-join
>>>>>> functionality, it'll automatically be added back in when it
>>>>>> re-establishes communication.
>>>>>>
>>>>>> It might then even be possible to write the cmap data out to a
>>>>>> file when
>>>>>> a node joins or leaves. This way if corosync restarts, and the
>>>>>> corosync.conf hasn't been updated, the nodelist can be read from
>>>>>> this
>>>>>> save. If the save is out of date, and some nodes are unreachable,
>>>>>> they
>>>>>> would simply be removed, and added when they join.
>>>>>> This wouldn't even have to be a part of corosync. Could have some
>>>>>> external utility watch the cmap values, and take care of setting
>>>>>> them
>>>>>> when corosync is launched.
>>>>>>
>>>>>> Ultimately this allows us to have a large scale dynamically sized
>>>>>> cluster without having to edit the config of every node each time
>>>>>> a node
>>>>>> joins or leaves.
>>>>>>
>>>>> Actually, this is exactly what pcs does.
>>>> Unfortunately pcs has lots of issues.
>>>>
>>>>   1. It assumes you will be using pacemaker as well.
>>>>      In some of our uses, we are using corosync without pacemaker.
>>>>
>>>>   2. It still has *lots* of bugs. Even more once you start trying
>>>> to use
>>>>      non-fedora based distros.
>>>>      Some bugs have been open on the project for a year and a half.
>>>>
>>>>   3. It doesn't know the real address of its own host.
>>>>      What I mean is when a node is sitting behind NAT. We plan on
>>>> running
>>>>      corosync inside a docker container, and the container goes
>>>> through
>>>>      NAT if it needs to talk to another host. So pcs would need to
>>>> know
>>>>      the NAT address to advertise it to the other hosts. With the
>>>> method
>>>>      described here, that address is automatically discovered.
>>>>
>>>>   4. Doesn't handle automatic cleanup.
>>>>      If you remove a node, something has to go and clean that node up.
>>>>      Basically you would have to write a program to connect to the
>>>> quorum
>>>>      service and monitor for nodes going down, and then remove
>>>> them. But
>>>>      then what happens if that node was only temporarily down? Who is
>>>>      responsible for adding it back into the cluster? If the node that
>>>>      was down is responsible for adding itself back in, what if
>>>> another
>>>>      node joined the cluster while it was down? Its list will be
>>>>      incomplete. You could do a few things to try and alleviate these
>>>>      headaches, but automatic membership just feels more like the
>>>> right
>>>>      solution.
>>>>
>>>>   5. It doesn't allow you to adjust the config file.
>>>>
>>>>
>>>>
>>>>
>>>>>> This really doesn't sound like it would be hard to do. I might
>>>>>> even be
>>>>>> willing to attempt implementing it myself if this sounds like
>>>>>> something
>>>>>> that would be acceptable to merge into the code base.
>>>>>> Thoughts?
>>>>>>
>>>>> Yes, but question is if it is really worth of it. I mean:
>>>>> - With multicast you have FULLY dynamic membership
>>>>> - PCS is able to distribute config file so adding new node to UDPU
>>>>> cluster is easy
>>>>>
>>>>> Do you see any use case where pcs or multicast doesn't work? (to
>>>>> clarify. I'm not blaming your idea (actually I find it
>>>>> interesting) but
>>>>> I'm trying to find out real killer use case for this feature which
>>>>> implementation will take quite a lot time almost for sure).
>>>> Aside from the pcs issues mentioned above, having this in corosync
>>>> just
>>>> feels like the right solution. No external processes involved, no
>>>> additional lines of communication, real-time on-demand updating.
>>>> The end
>>>> goal might be able to be accomplished by modifying pcs to resolve the
>>>> issues, but is that the right way? If people want to use crmsh over
>>>> pcs,
>>>> do they not get this functionality?
>>>>
>>>>> Regards,
>>>>>    Honza
>>>>>
>>>>>> -Patrick
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list
>>>>>> [email protected]
>>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>>>
>>>>
>>
>>
>

_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Reply via email to