Re: Enablement of controller clustering

Brendan McAdams Fri, 22 Sep 2017 11:00:59 -0700

Tyson makes a good point, WRT his environment that appears to be all
dynamic IPs via Mesos. This is something I've run into with more than a few
Akka Cluster deployments being frustrating because you really need at least
a few stable seeds for your other nodes startup list. I usually recommend
trying to have 2 or 3 (ideally at least 3) static IPs or at least static
DNS entries that you can give as your seeds so your dynamic nodes come up
cleanly. I don't recall, unfortunately, the best way to accomplish this
under mesos. *But, *if clustering is a feature OpenWhisk wants to expose it
seems like something that is going to need testing under a few different
"common" environments


I have to double check the internals on the "ip change" edge case. I don't
believe it will make a difference at a baseline, but if there are specific
cases of concern (such as a particular component like sharding coordinator
[which will auto-migrate if the node it is running on goes down anyway])
I'd be happy to dive into them.

-b



On Fri, Sep 22, 2017 at 10:42 AM, Vadim Raskin <[email protected]>
wrote:

> Thanks for the feedback.
>
> I'm ok with keeping local bookkeeping as a default for a while.
>
> Regarding the "edge case", what I meant is that it is not an issue to add
> the same node under a different IP into the cluster during the outage,
> based on the tests that I've made. NOT that deployment models without
> static IPs is the "edge case".
>
> Regards, Vadim.
>
> On пт, 22 сент. 2017 г. at 18:53 Tyson Norris <[email protected]>
> wrote:
>
> > Thanks Vadim!
> >
> > A couple comments:
> > - just to be clear: this is leveraging Akka Clustering (not just Akka
> > Remoting)
> > - I’m interested to hear if "deployment models where controller
> > container’s IP changes upon the restart” is actually an edge case (it is
> > not for us)
> > - I’m not an Akka or Akka Cluster expert, but we’ve been testing Akka
> > clustering (separate from OW) this and had problems in these cases due to
> > dynamic IPs, where it has required logic to explicitly down the nodes to
> > return to normal operation after a failure; (would like to hear from any
> > Akka/Cluster experts on this topic!)
> >
> > IMHO, this is often NOT an edge case, and as such, until the impl is more
> > flexible (to allow how seed nodes are defined and downing is handled),
> then
> > the default should be to NOT enable this.
> >
> > For example, in mesos, we will not predict the IP address of the
> > controller at restart, so this will lead to unreachable nodes list that
> is
> > never cleared without manual intervention.
> >
> > I mentioned this would be OK (as a first step, to require manual
> > intervention), but I think the default should be to disable this
> clustering
> > until it can be handled for various deployment scenarios, and in the
> > meantime, if people do want to enable this for the “dynamic IP” scenario,
> > there needs to be documentation to indicate exactly what steps need to be
> > take to handle downing, and what the risks are of NOT doing this.
> >
> > Of course this could be seen as "just a matter of defaults”, so its not
> > technically a big difference to enable it by default (vs disabled), but I
> > would err on the side that will produce the best results for more
> operators.
> >
> > WDYT?
> >
> > Thanks
> > Tyson
> >
> > > On Sep 22, 2017, at 9:00 AM, Vadim Raskin <[email protected]>
> wrote:
> > >
> > > Hi everyone,
> > > (sorry if dup, had some issues with mail delivery)
> > >
> > > just wanted to give a small introduction to a piece of work which is
> > > currently ongoing in the field of controller scale out. In order to
> > enable
> > > several active controller instances running simultaneously we introduce
> > > controller clustering, whose main purpose is to share the controller
> > > bookkeeping information, e.g. activations per invoker and activations
> per
> > > namespace. Under the hood we use Akka Remoting, which showed good
> > behaviour
> > > with no regression in our test environments. The introduction of this
> > > feature alone should not change the external behaviour of controllers
> > > unless the routing to more then one controller is explicitly enabled.
> > >
> > > The next recommended steps after the clustering goes into the master:
> > > - keep two controllers deployed as before in an active-passive mode
> with
> > > clustering enabled, let controllers replicate their data meanwhile
> > > collecting operational experience.
> > > - scale out the number of controller nodes, enable active-active mode
> in
> > > the upfront loadbalancer.
> > >
> > > A couple of things to keep in mind:
> > > * this change comes with a feature toggle, which means you could easily
> > > turn off clustering by setting a controllerLocalBookkeeping in your
> > > deployment. This is more appropriate for the first phase when only one
> > > controller is active.
> > > * there could be certain edge cases where clustering would require a
> > > special treatment in case of deployment models where controller
> > container's
> > > IP changes upon the restart. Say if one controller has failed and
> joined
> > > the cluster as a new member, there will be some garbage accumulated in
> > the
> > > list of cluster members. It is not harmful per se, e.g. the cluster is
> > > still running, however healthy cluster nodes will be still gossiping
> > with a
> > > non-existing container. If assigning static IP addresses is not an
> > option,
> > > in order to avoid this case one could use auto-downing feature in akka
> > > cluster, which allows to a cluster leader to mark the failing node as
> > down
> > > and remove it from the cluster. To prevent cluster partitioning due to
> > > several leaders this property must be set a relatively high value. The
> > > number is not deterministic and could be defined based on the further
> ops
> > > experience.
> > >
> > > If you have any feedback regarding this change, you could respond in
> this
> > > thread, ping me on slack or comment in this PR:
> > >
> > https://na01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-
> openwhisk%2Fpull%2F2531&data=02%7C01%7C%7C53dd4bae8c49491e0c7b08d501d3
> 0688%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%
> 7C636416928221587729&sdata=OiNhlcwMf2G5VtlSq%2Fxp4z0Rf6bv64wQilCRehEbmMI%
> 3D&reserved=0
> > >
> > > regards, Vadim Raskin.
> >
> >
>

Re: Enablement of controller clustering

Reply via email to