Hello Jem,
We looked deeper into this and it seems that it’s both working as mendated
by the current design (I’ll explain in detail bellow), as well as there is
a way of forcing your desired behaviour (which totally makes sense in some
scenarios).

Analysis:
First let’s dissect your log and see what’s happening:

Note1: Seed nodes are nothing very magical. It’s only a list of nodes, a
joining node will try to talk to when trying to join a cluster.
Note2: Joining “self” is normal and expected.

Ok, so let’s look at the above logs and write up what’s happening:

// seed nodes = [51, 52]
// other node = [39]

> 51 starts; 52 not started yet, 39 not started yet
> 51 joins self, this is fine. This is the beginning of clusterA.
> 39 starts
> 39 contacts 51, joins it's cluster
> cluster singleton started on 39 or 51
> 52 starts
> 51 stops
// 51 never talked to 51 at this point (that's the root of the
problem!), it didn't make it in time before 51 died
|| if singleton was running on 51 the manager notices this, and it
will start it on 39
|| if singleton was running on 39, it stays there
> 52 tries to join the cluster; seed nodes are 51, 52; 51 just died
> 52 joins self, this is the beginning of clusterB! A new cluster has emerged.
> 52 has no idea about 39. Noone told it to contact 39, so it won't. (We do not 
> have magical auto-discovery)
> 52 starts the singleton (!).
// the singleton is running twice among our apps, but not "twice in
the same cluster" - because 52 has no way of knowing that there is
some 39 node running "somewhere".
> 51 comes back up, it has 52 in seed nodes, so it will join it;
> 51 notices that 52 has the singleton, and will not do anything to it.

Ok… Se we know why this happens. Is this “valid” behaviour? Well… It’s
“expected” - effectively this shows that two clusters have raised, not one.

Then, the seed nodes never had the chance to talk to each other about “that
new guy” who joined, so it’s address is unknown to 52 - which creates a new
cluster, which the new 51 instance joins => creating a completely new
cluster.

I may just say “this is fine” of course, and for some applications it might
be. But I definitely see good use cases for really guaranteeing this
singleton instance.

Suggestions:
Here are a few ways to increase it’s resilience:

1) We can *leverage roles* in order to keep the cluster singleton from
starting until more seed nodes know about each other.
This allows us to not loose information about the 39 node if 51 goes down,
because 52 will also be aware of it.

Basically the idea here is that “there always must be at least one seed
node, that knows the singletons”.
This way you can increase the resilience of the system (how much guarantees
we get about the singleton not suddenly becoming a doubleton ;-)), by
increasing the number of seed nodes.
Graphically speaking: A B C X Y Z, where ABC are seed nodes and X Y Z
joined later, means that we can afford to loose 2 of ABC at the same time,
and the remaining one will keep track of the singletons replicated to the
X Y Z nodes, so even when B and C re-join (new instances of apps), they
will get the information that the singleton is running already on “some
node called X”, of which otherwise the rejoining nodes would not know the
addresses (and would cause the problem as in the above example).

Code wise, it’s very simple to implement, and I’ve prepared a pull request
with a sample for you:
https://github.com/Synesso/scratch-akka-cluster-singleton/pull/1/files
We just mark all seed nodes with special seed role. This means that we
won’t start the cluster until seed nodes have been contracted. By
increasing their number you get more resilience against failing (and
getting a doubleton on restarting these services, because they will not
form a new cluster, but re-join the “last man standing” seed node).

2) You could try to stop using seed-nodes, because they’re static, and
thus… tricky.

And instead use a “global” service registry, where each ActorSystem would
register itself when running.
Then when joining the cluster, you’d ask that service “hey, who is online
now?”. The difference from seed nodes here is that the initial contact
points can be updated, and seed-nodes are hardcoded in the config.
I’ve implemented such systems using ZooKeeper in the past. You would have a
paths like /akka/clusters/banana-cluster/node-*, and do an “ls” on the
parent directory, to find out about existing nodes (and their addresses)…

This is probably a good idea if you really need to be sure about everything
in your cluster.
We currently do not provide cluster auto-discovery which would solve this
for you “magically” :-)

I hope this makes sense! Please let me know if more explanation is required.
I have also opened an issue (
https://www.assembla.com/spaces/akka/simple_planner#/ticket:3986 ) around
this and will improve the docs to include these patterns.
Not sure how much we can “automagically guarantee” in the future here - we
would need to implement cluster discovery (not sure if it’s in the road
map, will check).

Note3: For the suggested solution (above), please downgrade to Akka 2.3.0.
We have introduced (and already fixed) a bug in the cluster in 2.3.1 which
prevents the suggested solution from working (nodes won’t join).

// Whew, quite long email!

-- 
Cheers,
Konrad 'ktoso' Malawski
hAkker - Typesafe, Inc

<http://www.scaladays.org/>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to