[akka-user] Akka cluster node shutting down in the middle of processing requests

Eric Swenson Fri, 05 Aug 2016 12:59:49 -0700

Our akka-cluster-sharding service went down last night. In the middle of 
processing akka-http requests (and sending these requests to a sharding 
region for processing) on a 10-node cluster, one of the requests got an 
"ask timeout" exception:


[ERROR] [08/05/2016 05:04:51.077] 
[ClusterSystem-akka.actor.default-dispatcher-16] 
[akka.actor.ActorSystemImpl(ClusterSystem)] Error during processing of 
request HttpRequest(HttpMetho\

d(GET),http://eim.staging.example.com/eim/check/a0afbad4-69a8-4487-a6fb-f3e884a8d0aa?cache=false&timeout=15,List(Host:
 
eim.staging.example.com, X-Real-Ip: 10.0.3.9, X-Forwarded-Fo\

r: 10.0.3.9, Connection: upgrade, Accept: */*, Accept-Encoding: gzip, 
deflate, compress, Authorization: Bearer 
aaa-1ZdrFpgR5AyOGa69Q2s3fwv_y5zz9UCL5F85Hc, User-Agent: python-requests\

/2.2.1 CPython/2.7.6 Linux/3.13.0-74-generic, Timeout-Access: 
<function1>),HttpEntity.Strict(application/json,),HttpProtocol(HTTP/1.1))  
                                               

akka.pattern.AskTimeoutException: 
Recipient[Actor[akka://ClusterSystem/system/sharding/ExperimentInstance#-1675878517]]
 
had already been terminated. Sender[null] sent the message of t\

ype "com.genecloud.eim.ExperimentInstance$Commands$CheckExperiment".       
                                                                            
                                 

As the error message says, the reason for the ask timeout was because the 
actor (sharding region?) had been terminated.  

Looking back in the logs, I see that everything was going well for quite 
some time, until the following:


[DEBUG] [08/05/2016 05:04:50.480] 
[ClusterSystem-akka.actor.default-dispatcher-2] 
[akka.tcp://ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] 
Resolving login.dev.example.com before connecting

[DEBUG] [08/05/2016 05:04:50.480] 
[ClusterSystem-akka.actor.default-dispatcher-2] 
[akka.tcp://ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] 
Attempting connection to [login.dev.example.com/52.14.30.100:443]

[DEBUG] [08/05/2016 05:04:50.481] 
[ClusterSystem-akka.actor.default-dispatcher-6] 
[akka.tcp://ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] 
Connection established to [login.dev.example.com:443]

[DEBUG] [08/05/2016 05:04:50.481] 
[ClusterSystem-akka.actor.default-dispatcher-5] 
[akka://ClusterSystem/system/IO-TCP] no longer watched by 
Actor[akka://ClusterSystem/user/StreamSupervisor-7/$$H#-1778344261]

[DEBUG] [08/05/2016 05:04:50.481] 
[ClusterSystem-akka.actor.default-dispatcher-5] 
[akka://ClusterSystem/system/IO-TCP/selectors/$a/955] now watched by 
Actor[akka://ClusterSystem/user/StreamSupervisor-7/$$H#-1778344261]

[DEBUG] [08/05/2016 05:04:50.483] 
[ClusterSystem-akka.actor.default-dispatcher-17] 
[akka://ClusterSystem/user/StreamSupervisor-7] now supervising 
Actor[akka://ClusterSystem/user/Strea
mSupervisor-7/flow-993-1-unknown-operation#819117275]

[DEBUG] [08/05/2016 05:04:50.484] 
[ClusterSystem-akka.actor.default-dispatcher-2] 
[akka://ClusterSystem/user/StreamSupervisor-7/flow-993-1-unknown-operation] 
started (akka.stream.impl.io.TLSActor@66dd942c)

[INFO] [08/05/2016 05:04:50.526] 
[ClusterSystem-akka.actor.default-dispatcher-17] 
[akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node 
[akka.tcp://ClusterSystem@10.0.3.103:2552] - Shutting down myself

[INFO] [08/05/2016 05:04:50.527] 
[ClusterSystem-akka.actor.default-dispatcher-17] 
[akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node 
[akka.tcp://ClusterSystem@10.0.3.103:2552] - Shutting down...

[DEBUG] [08/05/2016 05:04:50.528] 
[ClusterSystem-akka.actor.default-dispatcher-3] 
[akka://ClusterSystem/system/cluster/core] stopping

[DEBUG] [08/05/2016 05:04:50.528] 
[ClusterSystem-akka.actor.default-dispatcher-16] 
[akka://ClusterSystem/system/cluster/heartbeatReceiver] stopped

[DEBUG] [08/05/2016 05:04:50.539] 
[ClusterSystem-akka.actor.default-dispatcher-16] 
[akka://ClusterSystem/system/cluster/metrics] stopped

[DEBUG] [08/05/2016 05:04:50.540] 
[ClusterSystem-akka.actor.default-dispatcher-18] 
[akka://ClusterSystem/system/cluster] stopping

[INFO] [08/05/2016 05:04:50.573] 
[ClusterSystem-akka.actor.default-dispatcher-18] 
[akka.tcp://ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstanceCoordinator]
 
Self removed\

, stopping ClusterSingletonManager

[DEBUG] [08/05/2016 05:04:50.573] 
[ClusterSystem-akka.actor.default-dispatcher-18] 
[akka://ClusterSystem/system/sharding/ExperimentInstanceCoordinator] 
stopping


As you can see, from the "Resolving..." and "Attempting connection" log 
messages, an actor was happily sending off an HTTP request to another 
microservice using TLS, but just after this point, the cluster node said it 
was "Shutting down myself".  This killed the ClusterSingletonManager.  From 
that point on, all incoming requests to the shard region were rejected 
(because it was down).


Now the node was NOT down, and there are tons of messages after this point 
in the log -- and any request that came in from akka-http and was forwarded 
to the sharding region timed out (for the same reason).  This basically 
meant that the service rejected all subsequent requests to this node with 
http status code 500.  


There is no code in our software that shuts down any cluster member.  


We were firing a lot of requests to the service at the time this happened, 
but the average cpu utilization on the 10-node cluster that ran this 
microservice was never over 30%.


Why would the node decide to shut itself down? 


Any help would be appreciated. I'm concerned about the robustness of Akka 
in production since we hope to scale to a much greater degree than 10 
nodes....


-- Eric



-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

[akka-user] Akka cluster node shutting down in the middle of processing requests

Reply via email to