[akka-user] Re: Akka cluster node shutting down in the middle of processing requests

Eric Swenson Fri, 05 Aug 2016 13:02:09 -0700

Also, what does this message mean?  I saw it earlier on in the logs:

[DEBUG] [08/05/2016 05:04:50.450] 
[ClusterSystem-akka.actor.default-dispatcher-17] 
[akka://ClusterSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.3.176%3A2552-6]
 
unhandled message from 
Actor[akka://ClusterSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.3.176%3A2552-6#1200432312]:
 
Ungate


On Friday, August 5, 2016 at 12:58:55 PM UTC-7, Eric Swenson wrote:
>
> Our akka-cluster-sharding service went down last night. In the middle of 
> processing akka-http requests (and sending these requests to a sharding 
> region for processing) on a 10-node cluster, one of the requests got an 
> "ask timeout" exception:
>
> [ERROR] [08/05/2016 05:04:51.077] 
> [ClusterSystem-akka.actor.default-dispatcher-16] 
> [akka.actor.ActorSystemImpl(ClusterSystem)] Error during processing of 
> request HttpRequest(HttpMetho\
>
> d(GET),
> http://eim.staging.example.com/eim/check/a0afbad4-69a8-4487-a6fb-f3e884a8d0aa?cache=false&timeout=15,List(Host:
>  
> eim.staging.example.com, X-Real-Ip: 10.0.3.9, X-Forwarded-Fo\
>
> r: 10.0.3.9, Connection: upgrade, Accept: */*, Accept-Encoding: gzip, 
> deflate, compress, Authorization: Bearer 
> aaa-1ZdrFpgR5AyOGa69Q2s3fwv_y5zz9UCL5F85Hc, User-Agent: python-requests\
>
> /2.2.1 CPython/2.7.6 Linux/3.13.0-74-generic, Timeout-Access: 
> <function1>),HttpEntity.Strict(application/json,),HttpProtocol(HTTP/1.1))  
>                                                
>
> akka.pattern.AskTimeoutException: 
> Recipient[Actor[akka://ClusterSystem/system/sharding/ExperimentInstance#-1675878517]]
>  
> had already been terminated. Sender[null] sent the message of t\
>
> ype "com.genecloud.eim.ExperimentInstance$Commands$CheckExperiment".       
>                                                                             
>                                  
>
> As the error message says, the reason for the ask timeout was because the 
> actor (sharding region?) had been terminated.  
>
> Looking back in the logs, I see that everything was going well for quite 
> some time, until the following:
>
>
> [DEBUG] [08/05/2016 05:04:50.480] 
> [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] Resolving 
> login.dev.example.com before connecting
>
> [DEBUG] [08/05/2016 05:04:50.480] 
> [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] Attempting 
> connection to [login.dev.example.com/52.14.30.100:443]
>
> [DEBUG] [08/05/2016 05:04:50.481] 
> [ClusterSystem-akka.actor.default-dispatcher-6] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] Connection 
> established to [login.dev.example.com:443]
>
> [DEBUG] [08/05/2016 05:04:50.481] 
> [ClusterSystem-akka.actor.default-dispatcher-5] 
> [akka://ClusterSystem/system/IO-TCP] no longer watched by 
> Actor[akka://ClusterSystem/user/StreamSupervisor-7/$$H#-1778344261]
>
> [DEBUG] [08/05/2016 05:04:50.481] 
> [ClusterSystem-akka.actor.default-dispatcher-5] 
> [akka://ClusterSystem/system/IO-TCP/selectors/$a/955] now watched by 
> Actor[akka://ClusterSystem/user/StreamSupervisor-7/$$H#-1778344261]
>
> [DEBUG] [08/05/2016 05:04:50.483] 
> [ClusterSystem-akka.actor.default-dispatcher-17] 
> [akka://ClusterSystem/user/StreamSupervisor-7] now supervising 
> Actor[akka://ClusterSystem/user/Strea
> mSupervisor-7/flow-993-1-unknown-operation#819117275]
>
> [DEBUG] [08/05/2016 05:04:50.484] 
> [ClusterSystem-akka.actor.default-dispatcher-2] 
> [akka://ClusterSystem/user/StreamSupervisor-7/flow-993-1-unknown-operation] 
> started (akka.stream.impl.io.TLSActor@66dd942c)
>
> [INFO] [08/05/2016 05:04:50.526] 
> [ClusterSystem-akka.actor.default-dispatcher-17] 
> [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://
> ClusterSystem@10.0.3.103:2552] - Shutting down myself
>
> [INFO] [08/05/2016 05:04:50.527] 
> [ClusterSystem-akka.actor.default-dispatcher-17] 
> [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://
> ClusterSystem@10.0.3.103:2552] - Shutting down...
>
> [DEBUG] [08/05/2016 05:04:50.528] 
> [ClusterSystem-akka.actor.default-dispatcher-3] 
> [akka://ClusterSystem/system/cluster/core] stopping
>
> [DEBUG] [08/05/2016 05:04:50.528] 
> [ClusterSystem-akka.actor.default-dispatcher-16] 
> [akka://ClusterSystem/system/cluster/heartbeatReceiver] stopped
>
> [DEBUG] [08/05/2016 05:04:50.539] 
> [ClusterSystem-akka.actor.default-dispatcher-16] 
> [akka://ClusterSystem/system/cluster/metrics] stopped
>
> [DEBUG] [08/05/2016 05:04:50.540] 
> [ClusterSystem-akka.actor.default-dispatcher-18] 
> [akka://ClusterSystem/system/cluster] stopping
>
> [INFO] [08/05/2016 05:04:50.573] 
> [ClusterSystem-akka.actor.default-dispatcher-18] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstanceCoordinator] 
> Self removed\
>
> , stopping ClusterSingletonManager
>
> [DEBUG] [08/05/2016 05:04:50.573] 
> [ClusterSystem-akka.actor.default-dispatcher-18] 
> [akka://ClusterSystem/system/sharding/ExperimentInstanceCoordinator] 
> stopping
>
>
> As you can see, from the "Resolving..." and "Attempting connection" log 
> messages, an actor was happily sending off an HTTP request to another 
> microservice using TLS, but just after this point, the cluster node said it 
> was "Shutting down myself".  This killed the ClusterSingletonManager.  From 
> that point on, all incoming requests to the shard region were rejected 
> (because it was down).
>
>
> Now the node was NOT down, and there are tons of messages after this point 
> in the log -- and any request that came in from akka-http and was forwarded 
> to the sharding region timed out (for the same reason).  This basically 
> meant that the service rejected all subsequent requests to this node with 
> http status code 500.  
>
>
> There is no code in our software that shuts down any cluster member.  
>
>
> We were firing a lot of requests to the service at the time this happened, 
> but the average cpu utilization on the 10-node cluster that ran this 
> microservice was never over 30%.
>
>
> Why would the node decide to shut itself down? 
>
>
> Any help would be appreciated. I'm concerned about the robustness of Akka 
> in production since we hope to scale to a much greater degree than 10 
> nodes....
>
>
> -- Eric
>
>
>
>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

[akka-user] Re: Akka cluster node shutting down in the middle of processing requests

Reply via email to