[akka-user] Sharding coordinator comms issues running on Kubernetes

Stephen Kennedy Tue, 01 Aug 2017 06:11:25 -0700

Hi,

Seeing some strange issues running Cluster Sharded actors in a kubernetes 
environment.


We are currently running a 3 node akka cluster, with our app running within 
a docker container within a kubernetes stateful set (similar to the 
akka-seed set described here 
<https://medium.com/google-cloud/clustering-akka-in-kubernetes-with-statefulset-and-deployment-459c0e05f2ea>).
 
The nodes are all akka-http API servers, which run behind a HTTP load 
balancer, and use akka-persistence for our domain entities. So we use 
cluster sharding to ensure that each entity can only live on a single node 
at once.

So the cluster config of our akka nodes looks something like this:

akka {
  remote {
    enabled-transports = ["akka.remote.netty.tcp"]
    netty.tcp {
      hostname = ${POD_NAME}.api
      port = 2551
    }
  }

  cluster {
    seed-nodes = [
      "akka.tcp://actor-sys...@api-0.api:2551",
      "akka.tcp://actor-sys...@api-1.api:2551",
      "akka.tcp://actor-sys...@api-2.api:2551"
    ]
  }
}


Where "api" is the kubernetes service (which provides DNS mapping) and 
"api-0/1/2" are the consistent pod names that using a stateful set gives 
us. Using the default sharding config.

And within the code we have a number of calls to ClusterSharding.start - 
for each type of our sharded entity actors. We then only fire a message to 
these actors when an appropriate API call comes in.

Now when the nodes come up they all consistently connect to the cluster 
properly, and I see gossip messages suggesting they all know about each 
other, but we are then seeing communication issues on the Cluster Sharded 
actors.

As far as I can tell, if the first API request for a particular type of 
entity comes into api-1 or api-2, it often fails because that node is 
unable to communicate with the coordinator - which it seems to think is on 
api-0.

>From api-1:
2017-08-01 12:22:15.253 DEBUG akka.actor.ActorSystemImpl - 
http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting
2017-08-01 12:23:05.336 WARN  akka.cluster.sharding.ShardRegion - Trying to 
register to coordinator at 
[Some(ActorSelection[Anchor(akka.tcp://actor-sys...@api-0.api:2551/), 
Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], 
but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:23:07.336 WARN  akka.cluster.sharding.ShardRegion - Trying to 
register to coordinator at 
[Some(ActorSelection[Anchor(akka.tcp://actor-sys...@api-0.api:2551/), 
Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], 
but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:23:09.336 WARN  akka.cluster.sharding.ShardRegion - Trying to 
register to coordinator at 
[Some(ActorSelection[Anchor(akka.tcp://actor-sys...@api-0.api:2551/), 
Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], 
but no acknowledgement. Total [10] buffered messages.
akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]] after 
[15000 ms]. Sender[null] sent message of type 
"com.goodlord.server.domain.envelope.EnvelopeAggregate$Protocol$Get".
2017-08-01 12:23:10.344 DEBUG akka.actor.ActorSystemImpl - 
http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - 500 Internal Server Error

The trying to register log lines are then repeated forever and if more API 
calls come into this node (for this shard) the buffered count just goes up. 
Logging into the 

However, if we subsequently hit the API and the load balancer chooses 
api-0, then it works and also seems to initialize the coordinator (which 
then triggers api-1 to register which clears it's backlog):

>From api-0:
2017-08-01 12:25:05.902 DEBUG akka.actor.ActorSystemImpl - 
http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting
2017-08-01 12:25:05.932 INFO  a.c.s.ClusterSingletonManager - Singleton 
manager starting singleton actor 
[akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton]
2017-08-01 12:25:05.933 DEBUG akka.cluster.ddata.Replicator - Received Get 
for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.338 DEBUG a.c.sharding.DDataShardCoordinator - 
ShardRegion registered: 
[Actor[akka.tcp://actor-sys...@api-1.api:2551/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.339 DEBUG akka.cluster.ddata.Replicator - Received 
Update for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.341 DEBUG a.c.sharding.DDataShardCoordinator - The 
coordinator state was successfully updated with 
ShardRegionRegistered(Actor[akka.tcp://actor-sys...@api-1.api:2551/system/sharding/EnvelopeShard#92546893])
2017-08-01 12:25:07.341 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: 
[akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton/coordinator
 
-> akka.tcp://actor-sys...@api-1.api:2551/system/sharding/EnvelopeShard]
2017-08-01 12:25:07.345 DEBUG akka.cluster.ddata.Replicator - Received 
Update for key [EnvelopeShardCoordinatorState]
2017-08-01 12:25:07.348 DEBUG a.c.sharding.DDataShardCoordinator - The 
coordinator state was successfully updated with 
ShardHomeAllocated(22,Actor[akka.tcp://actor-sys...@api-1.api:2551/system/sharding/EnvelopeShard#92546893])

>From api-1:

2017-08-01 12:25:07.336 WARN  akka.cluster.sharding.ShardRegion - Trying to 
register to coordinator at 
[Some(ActorSelection[Anchor(akka.tcp://actor-sys...@api-0.api:2551/), 
Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], 
but no acknowledgement. Total [10] buffered messages.
2017-08-01 12:25:07.342 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: 
[akka://actor-system/system/sharding/EnvelopeShard -> 
akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator]
2017-08-01 12:25:07.342 WARN  akka.cluster.sharding.ShardRegion - Retry 
request for shard [22] homes from coordinator at 
[Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]].
 
[1] buffered messages.
2017-08-01 12:25:07.342 WARN  akka.cluster.sharding.ShardRegion - Retry 
request for shard [13] homes from coordinator at 
[Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]].
 
[1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry 
request for shard [16] homes from coordinator at 
[Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]].
 
[1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry 
request for shard [6] homes from coordinator at 
[Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]].
 
[2] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry 
request for shard [29] homes from coordinator at 
[Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]].
 
[1] buffered messages.
2017-08-01 12:25:07.343 WARN  akka.cluster.sharding.ShardRegion - Retry 
request for shard [20] homes from coordinator at 
[Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]].
 
[1] buffered messages.
2017-08-01 12:25:07.344 WARN  akka.cluster.sharding.ShardRegion - Retry 
request for shard [21] homes from coordinator at 
[Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]].
 
[2] buffered messages.
2017-08-01 12:25:07.344 WARN  akka.cluster.sharding.ShardRegion - Retry 
request for shard [10] homes from coordinator at 
[Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]].
 
[1] buffered messages.
2017-08-01 12:25:07.353 DEBUG akka.cluster.sharding.ShardRegion - Shard 
[22] located at 
[Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.354 DEBUG akka.cluster.sharding.ShardRegion - Shard 
[13] located at 
[Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]
2017-08-01 12:25:07.362 DEBUG akka.cluster.sharding.ShardRegion - Shard 
[16] located at 
[Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]]

So it seems to me that the initial call to ClusterSharding.start doesn't 
actually do the registration of all shard regions on the cluster and that 
it is only when we subsequently fire a message to the shard region that the 
initialization occurs. 

Is this expected behaviour? Or have we got something wrong with the 
initialization of our app? 

Sure enough if I hack together some code to fire a random message to the 
shard on startup then all our problems go away.

Thanks in advance for any help here - don't have a great deal of experience 
with akka clustering so I'm sure we've done something stupid somewhere.

Cheers!
Stephen

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

[akka-user] Sharding coordinator comms issues running on Kubernetes

Reply via email to