Hi, Seeing some strange issues running Cluster Sharded actors in a kubernetes environment.
We are currently running a 3 node akka cluster, with our app running within a docker container within a kubernetes stateful set (similar to the akka-seed set described here <https://medium.com/google-cloud/clustering-akka-in-kubernetes-with-statefulset-and-deployment-459c0e05f2ea>). The nodes are all akka-http API servers, which run behind a HTTP load balancer, and use akka-persistence for our domain entities. So we use cluster sharding to ensure that each entity can only live on a single node at once. So the cluster config of our akka nodes looks something like this: akka { remote { enabled-transports = ["akka.remote.netty.tcp"] netty.tcp { hostname = ${POD_NAME}.api port = 2551 } } cluster { seed-nodes = [ "akka.tcp://actor-sys...@api-0.api:2551", "akka.tcp://actor-sys...@api-1.api:2551", "akka.tcp://actor-sys...@api-2.api:2551" ] } } Where "api" is the kubernetes service (which provides DNS mapping) and "api-0/1/2" are the consistent pod names that using a stateful set gives us. Using the default sharding config. And within the code we have a number of calls to ClusterSharding.start - for each type of our sharded entity actors. We then only fire a message to these actors when an appropriate API call comes in. Now when the nodes come up they all consistently connect to the cluster properly, and I see gossip messages suggesting they all know about each other, but we are then seeing communication issues on the Cluster Sharded actors. As far as I can tell, if the first API request for a particular type of entity comes into api-1 or api-2, it often fails because that node is unable to communicate with the coordinator - which it seems to think is on api-0. >From api-1: 2017-08-01 12:22:15.253 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting 2017-08-01 12:23:05.336 WARN akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-sys...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages. 2017-08-01 12:23:07.336 WARN akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-sys...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages. 2017-08-01 12:23:09.336 WARN akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-sys...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages. akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]] after [15000 ms]. Sender[null] sent message of type "com.goodlord.server.domain.envelope.EnvelopeAggregate$Protocol$Get". 2017-08-01 12:23:10.344 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - 500 Internal Server Error The trying to register log lines are then repeated forever and if more API calls come into this node (for this shard) the buffered count just goes up. Logging into the However, if we subsequently hit the API and the load balancer chooses api-0, then it works and also seems to initialize the coordinator (which then triggers api-1 to register which clears it's backlog): >From api-0: 2017-08-01 12:25:05.902 DEBUG akka.actor.ActorSystemImpl - http://xxxxxxxxxxx/envelopes - HttpMethod(GET) - Starting 2017-08-01 12:25:05.932 INFO a.c.s.ClusterSingletonManager - Singleton manager starting singleton actor [akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton] 2017-08-01 12:25:05.933 DEBUG akka.cluster.ddata.Replicator - Received Get for key [EnvelopeShardCoordinatorState] 2017-08-01 12:25:07.338 DEBUG a.c.sharding.DDataShardCoordinator - ShardRegion registered: [Actor[akka.tcp://actor-sys...@api-1.api:2551/system/sharding/EnvelopeShard#92546893]] 2017-08-01 12:25:07.339 DEBUG akka.cluster.ddata.Replicator - Received Update for key [EnvelopeShardCoordinatorState] 2017-08-01 12:25:07.341 DEBUG a.c.sharding.DDataShardCoordinator - The coordinator state was successfully updated with ShardRegionRegistered(Actor[akka.tcp://actor-sys...@api-1.api:2551/system/sharding/EnvelopeShard#92546893]) 2017-08-01 12:25:07.341 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: [akka://actor-system/system/sharding/EnvelopeShardCoordinator/singleton/coordinator -> akka.tcp://actor-sys...@api-1.api:2551/system/sharding/EnvelopeShard] 2017-08-01 12:25:07.345 DEBUG akka.cluster.ddata.Replicator - Received Update for key [EnvelopeShardCoordinatorState] 2017-08-01 12:25:07.348 DEBUG a.c.sharding.DDataShardCoordinator - The coordinator state was successfully updated with ShardHomeAllocated(22,Actor[akka.tcp://actor-sys...@api-1.api:2551/system/sharding/EnvelopeShard#92546893]) >From api-1: 2017-08-01 12:25:07.336 WARN akka.cluster.sharding.ShardRegion - Trying to register to coordinator at [Some(ActorSelection[Anchor(akka.tcp://actor-sys...@api-0.api:2551/), Path(/system/sharding/EnvelopeShardCoordinator/singleton/coordinator)])], but no acknowledgement. Total [10] buffered messages. 2017-08-01 12:25:07.342 DEBUG akka.cluster.ClusterRemoteWatcher - Watching: [akka://actor-system/system/sharding/EnvelopeShard -> akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator] 2017-08-01 12:25:07.342 WARN akka.cluster.sharding.ShardRegion - Retry request for shard [22] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages. 2017-08-01 12:25:07.342 WARN akka.cluster.sharding.ShardRegion - Retry request for shard [13] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages. 2017-08-01 12:25:07.343 WARN akka.cluster.sharding.ShardRegion - Retry request for shard [16] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages. 2017-08-01 12:25:07.343 WARN akka.cluster.sharding.ShardRegion - Retry request for shard [6] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [2] buffered messages. 2017-08-01 12:25:07.343 WARN akka.cluster.sharding.ShardRegion - Retry request for shard [29] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages. 2017-08-01 12:25:07.343 WARN akka.cluster.sharding.ShardRegion - Retry request for shard [20] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages. 2017-08-01 12:25:07.344 WARN akka.cluster.sharding.ShardRegion - Retry request for shard [21] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [2] buffered messages. 2017-08-01 12:25:07.344 WARN akka.cluster.sharding.ShardRegion - Retry request for shard [10] homes from coordinator at [Actor[akka.tcp://actor-sys...@api-0.api:2551/system/sharding/EnvelopeShardCoordinator/singleton/coordinator#1730016074]]. [1] buffered messages. 2017-08-01 12:25:07.353 DEBUG akka.cluster.sharding.ShardRegion - Shard [22] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]] 2017-08-01 12:25:07.354 DEBUG akka.cluster.sharding.ShardRegion - Shard [13] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]] 2017-08-01 12:25:07.362 DEBUG akka.cluster.sharding.ShardRegion - Shard [16] located at [Actor[akka://actor-system/system/sharding/EnvelopeShard#92546893]] So it seems to me that the initial call to ClusterSharding.start doesn't actually do the registration of all shard regions on the cluster and that it is only when we subsequently fire a message to the shard region that the initialization occurs. Is this expected behaviour? Or have we got something wrong with the initialization of our app? Sure enough if I hack together some code to fire a random message to the shard on startup then all our problems go away. Thanks in advance for any help here - don't have a great deal of experience with akka clustering so I'm sure we've done something stupid somewhere. Cheers! Stephen -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: >>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscr...@googlegroups.com. To post to this group, send email to akka-user@googlegroups.com. Visit this group at https://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.