Our akka-cluster-sharding service went down last night. In the middle of processing akka-http requests (and sending these requests to a sharding region for processing) on a 10-node cluster, one of the requests got an "ask timeout" exception:
[ERROR] [08/05/2016 05:04:51.077] [ClusterSystem-akka.actor.default-dispatcher-16] [akka.actor.ActorSystemImpl(ClusterSystem)] Error during processing of request HttpRequest(HttpMetho\ d(GET),http://eim.staging.example.com/eim/check/a0afbad4-69a8-4487-a6fb-f3e884a8d0aa?cache=false&timeout=15,List(Host: eim.staging.example.com, X-Real-Ip: 10.0.3.9, X-Forwarded-Fo\ r: 10.0.3.9, Connection: upgrade, Accept: */*, Accept-Encoding: gzip, deflate, compress, Authorization: Bearer aaa-1ZdrFpgR5AyOGa69Q2s3fwv_y5zz9UCL5F85Hc, User-Agent: python-requests\ /2.2.1 CPython/2.7.6 Linux/3.13.0-74-generic, Timeout-Access: <function1>),HttpEntity.Strict(application/json,),HttpProtocol(HTTP/1.1)) akka.pattern.AskTimeoutException: Recipient[Actor[akka://ClusterSystem/system/sharding/ExperimentInstance#-1675878517]] had already been terminated. Sender[null] sent the message of t\ ype "com.genecloud.eim.ExperimentInstance$Commands$CheckExperiment". As the error message says, the reason for the ask timeout was because the actor (sharding region?) had been terminated. Looking back in the logs, I see that everything was going well for quite some time, until the following: [DEBUG] [08/05/2016 05:04:50.480] [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] Resolving login.dev.example.com before connecting [DEBUG] [08/05/2016 05:04:50.480] [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] Attempting connection to [login.dev.example.com/52.14.30.100:443] [DEBUG] [08/05/2016 05:04:50.481] [ClusterSystem-akka.actor.default-dispatcher-6] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] Connection established to [login.dev.example.com:443] [DEBUG] [08/05/2016 05:04:50.481] [ClusterSystem-akka.actor.default-dispatcher-5] [akka://ClusterSystem/system/IO-TCP] no longer watched by Actor[akka://ClusterSystem/user/StreamSupervisor-7/$$H#-1778344261] [DEBUG] [08/05/2016 05:04:50.481] [ClusterSystem-akka.actor.default-dispatcher-5] [akka://ClusterSystem/system/IO-TCP/selectors/$a/955] now watched by Actor[akka://ClusterSystem/user/StreamSupervisor-7/$$H#-1778344261] [DEBUG] [08/05/2016 05:04:50.483] [ClusterSystem-akka.actor.default-dispatcher-17] [akka://ClusterSystem/user/StreamSupervisor-7] now supervising Actor[akka://ClusterSystem/user/Strea mSupervisor-7/flow-993-1-unknown-operation#819117275] [DEBUG] [08/05/2016 05:04:50.484] [ClusterSystem-akka.actor.default-dispatcher-2] [akka://ClusterSystem/user/StreamSupervisor-7/flow-993-1-unknown-operation] started (akka.stream.impl.io.TLSActor@66dd942c) [INFO] [08/05/2016 05:04:50.526] [ClusterSystem-akka.actor.default-dispatcher-17] [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.0.3.103:2552] - Shutting down myself [INFO] [08/05/2016 05:04:50.527] [ClusterSystem-akka.actor.default-dispatcher-17] [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.0.3.103:2552] - Shutting down... [DEBUG] [08/05/2016 05:04:50.528] [ClusterSystem-akka.actor.default-dispatcher-3] [akka://ClusterSystem/system/cluster/core] stopping [DEBUG] [08/05/2016 05:04:50.528] [ClusterSystem-akka.actor.default-dispatcher-16] [akka://ClusterSystem/system/cluster/heartbeatReceiver] stopped [DEBUG] [08/05/2016 05:04:50.539] [ClusterSystem-akka.actor.default-dispatcher-16] [akka://ClusterSystem/system/cluster/metrics] stopped [DEBUG] [08/05/2016 05:04:50.540] [ClusterSystem-akka.actor.default-dispatcher-18] [akka://ClusterSystem/system/cluster] stopping [INFO] [08/05/2016 05:04:50.573] [ClusterSystem-akka.actor.default-dispatcher-18] [akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstanceCoordinator] Self removed\ , stopping ClusterSingletonManager [DEBUG] [08/05/2016 05:04:50.573] [ClusterSystem-akka.actor.default-dispatcher-18] [akka://ClusterSystem/system/sharding/ExperimentInstanceCoordinator] stopping As you can see, from the "Resolving..." and "Attempting connection" log messages, an actor was happily sending off an HTTP request to another microservice using TLS, but just after this point, the cluster node said it was "Shutting down myself". This killed the ClusterSingletonManager. From that point on, all incoming requests to the shard region were rejected (because it was down). Now the node was NOT down, and there are tons of messages after this point in the log -- and any request that came in from akka-http and was forwarded to the sharding region timed out (for the same reason). This basically meant that the service rejected all subsequent requests to this node with http status code 500. There is no code in our software that shuts down any cluster member. We were firing a lot of requests to the service at the time this happened, but the average cpu utilization on the 10-node cluster that ran this microservice was never over 30%. Why would the node decide to shut itself down? Any help would be appreciated. I'm concerned about the robustness of Akka in production since we hope to scale to a much greater degree than 10 nodes.... -- Eric -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: >>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscr...@googlegroups.com. To post to this group, send email to akka-user@googlegroups.com. Visit this group at https://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.