Re: [akka-user] Cluster Singleton duplicated if primary seed restarted

Jem Tue, 08 Apr 2014 22:47:31 -0700

Thank you Konrad. I'm really impressed by the depth of investigation and
your explanation.



On 8 April 2014 23:19, Konrad Malawski <konrad.malaw...@typesafe.com> wrote:

> Hello Jem,
> We looked deeper into this and it seems that it's both working as mendated
> by the current design (I'll explain in detail bellow), as well as there is
> a way of forcing your desired behaviour (which totally makes sense in some
> scenarios).
>
> Analysis:
> First let's dissect your log and see what's happening:
>
> Note1: Seed nodes are nothing very magical. It's only a list of nodes, a
> joining node will try to talk to when trying to join a cluster.
> Note2: Joining "self" is normal and expected.
>
> Ok, so let's look at the above logs and write up what's happening:
>
> // seed nodes = [51, 52]
> // other node = [39]
>
> > 51 starts; 52 not started yet, 39 not started yet
> > 51 joins self, this is fine. This is the beginning of clusterA.
> > 39 starts
> > 39 contacts 51, joins it's cluster
> > cluster singleton started on 39 or 51
> > 52 starts
> > 51 stops
> // 51 never talked to 51 at this point (that's the root of the problem!), it 
> didn't make it in time before 51 died
> || if singleton was running on 51 the manager notices this, and it will start 
> it on 39
> || if singleton was running on 39, it stays there
> > 52 tries to join the cluster; seed nodes are 51, 52; 51 just died
> > 52 joins self, this is the beginning of clusterB! A new cluster has emerged.
> > 52 has no idea about 39. Noone told it to contact 39, so it won't. (We do 
> > not have magical auto-discovery)
> > 52 starts the singleton (!).
> // the singleton is running twice among our apps, but not "twice in the same 
> cluster" - because 52 has no way of knowing that there is some 39 node 
> running "somewhere".
> > 51 comes back up, it has 52 in seed nodes, so it will join it;
> > 51 notices that 52 has the singleton, and will not do anything to it.
>
> Ok... Se we know why this happens. Is this "valid" behaviour? Well... It's
> "expected" - effectively this shows that two clusters have raised, not one.
>
> Then, the seed nodes never had the chance to talk to each other about
> "that new guy" who joined, so it's address is unknown to 52 - which creates
> a new cluster, which the new 51 instance joins => creating a completely new
> cluster.
>
> I may just say "this is fine" of course, and for some applications it
> might be. But I definitely see good use cases for really guaranteeing this
> singleton instance.
>
> Suggestions:
> Here are a few ways to increase it's resilience:
>
> 1) We can *leverage roles* in order to keep the cluster singleton from
> starting until more seed nodes know about each other.
> This allows us to not loose information about the 39 node if 51 goes down,
> because 52 will also be aware of it.
>
> Basically the idea here is that "there always must be at least one seed
> node, that knows the singletons".
> This way you can increase the resilience of the system (how much
> guarantees we get about the singleton not suddenly becoming a doubleton
> ;-)), by increasing the number of seed nodes.
> Graphically speaking: A B C X Y Z, where ABC are seed nodes and X Y Z
> joined later, means that we can afford to loose 2 of ABC at the same time,
> and the remaining one will keep track of the singletons replicated to the
> X Y Z nodes, so even when B and C re-join (new instances of apps), they
> will get the information that the singleton is running already on "some
> node called X", of which otherwise the rejoining nodes would not know the
> addresses (and would cause the problem as in the above example).
>
> Code wise, it's very simple to implement, and I've prepared a pull request
> with a sample for you:
> https://github.com/Synesso/scratch-akka-cluster-singleton/pull/1/files
> We just mark all seed nodes with special seed role. This means that we
> won't start the cluster until seed nodes have been contracted. By
> increasing their number you get more resilience against failing (and
> getting a doubleton on restarting these services, because they will not
> form a new cluster, but re-join the "last man standing" seed node).
>
> 2) You could try to stop using seed-nodes, because they're static, and
> thus... tricky.
>
> And instead use a "global" service registry, where each ActorSystem would
> register itself when running.
> Then when joining the cluster, you'd ask that service "hey, who is online
> now?". The difference from seed nodes here is that the initial contact
> points can be updated, and seed-nodes are hardcoded in the config.
> I've implemented such systems using ZooKeeper in the past. You would have
> a paths like /akka/clusters/banana-cluster/node-*, and do an "ls" on the
> parent directory, to find out about existing nodes (and their addresses)...
>
> This is probably a good idea if you really need to be sure about
> everything in your cluster.
> We currently do not provide cluster auto-discovery which would solve this
> for you "magically" :-)
>
> I hope this makes sense! Please let me know if more explanation is
> required.
> I have also opened an issue (
> https://www.assembla.com/spaces/akka/simple_planner#/ticket:3986 ) around
> this and will improve the docs to include these patterns.
> Not sure how much we can "automagically guarantee" in the future here - we
> would need to implement cluster discovery (not sure if it's in the road
> map, will check).
>
> Note3: For the suggested solution (above), please downgrade to Akka 2.3.0.
> We have introduced (and already fixed) a bug in the cluster in 2.3.1 which
> prevents the suggested solution from working (nodes won't join).
>
> // Whew, quite long email!
>
> --
> Cheers,
> Konrad 'ktoso' Malawski
> hAkker - Typesafe, Inc
>
> <http://www.scaladays.org/>
>
> --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "Akka User List" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/akka-user/ns7DPHGYbIk/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> akka-user+unsubscr...@googlegroups.com.
>
> To post to this group, send email to akka-user@googlegroups.com.
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] Cluster Singleton duplicated if primary seed restarted

Reply via email to