Re: [akka-user] Re: Akka cluster node shutting down in the middle of processing requests
Thanks. I've added that code fragment to our application. On Thursday, August 11, 2016 at 11:05:12 AM UTC-7, Patrik Nordwall wrote: > > I have not looked at the logs but you find answer to your last question in > http://doc.akka.io/docs/akka/2.4/scala/cluster-usage.html#How_To_Cleanup_when_Member_is_Removed > > /Patrik > > fre 5 aug. 2016 kl. 22:31 skrev Eric Swenson>: > >> One more clue as to the cluster daemon's shutting itself down. Earlier >> in the logs (although prior to several successful requests being handled), >> I find this: >> >> [INFO] [08/05/2016 05:04:45.042] >> [ClusterSystem-akka.actor.default-dispatcher-5] >> [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp:// >> ClusterSystem@10.0.3.103:2552] - Leader can currently not perform its >> duties, reachability status: [akka.tcp://ClusterSystem@10.0.3.103:2552 >> -> akka.tcp://ClusterSystem@10.0.3.102:2552: Unreachable [Unreachable] >> (16), akka.tcp://ClusterSystem@10.0.3.103:2552 -> akka.tcp:// >> ClusterSystem@10.0.3.104:2552: Unreachable [Unreachable] (17), >> akka.tcp://ClusterSystem@10.0.3.103:2552 -> akka.tcp:// >> ClusterSystem@10.0.3.176:2552: Unreachable [Unreachable] (18), >> akka.tcp://ClusterSystem@10.0.3.103:2552 -> akka.tcp:// >> ClusterSystem@10.0.3.240:2552: Unreachable [Unreachable] (19)], member >> status: [akka.tcp://ClusterSystem@10.0.3.102:2552 Up seen=false, >> akka.tcp://ClusterSystem@10.0.3.103:2552 Up seen=true, akka.tcp:// >> ClusterSystem@10.0.3.104:2552 Up seen=false, akka.tcp:// >> ClusterSystem@10.0.3.176:2552 Up seen=false, akka.tcp:// >> ClusterSystem@10.0.3.240:2552 Up seen=false] >> >> >> All these log messages are from the node at IP address 10.0.3.103. So >> I'm assuming this means the Leader is THIS node. It seems to be saying >> that it cannot reach all the other cluster members, and because of that, it >> cannot do its job. This probably accounts for why it decided to shut itself >> down. >> >> >> There were 6 AWS EC2 instances running this application at the time (not >> 10, as I said in an earlier message). However, the cluster membership >> above, only shows 5 members at the time of this log message. Not sure what >> happened to the other one. >> >> >> [akka.tcp://ClusterSystem@10.0.3.102:2552 Up seen=false, >> >> akka.tcp://ClusterSystem@10.0.3.103:2552 Up seen=true, >> >> akka.tcp://ClusterSystem@10.0.3.104:2552 Up seen=false, >> >> akka.tcp://ClusterSystem@10.0.3.176:2552 Up seen=false, >> >> akka.tcp://ClusterSystem@10.0.3.240:2552 Up seen=false] >> >> >> I'm going to assume, not having any other evidence, that AWS/EC2 >> experienced some network issue at the time in question, and consequently >> this node was not able to talk to the rest of the cluster and therefore >> this member (the leader) shut down. I only have logs for one of the other >> 5 cluster nodes, so I will check to see what that other node thought about >> all this at the time. But I'm not very comfortable with the robustness of >> akka here. I would have thought that the other cluster members could have, >> perhaps, noticing that the Leader was unreachable (assuming they couldn't >> reach it), and because I had auto-down-unreachable-after set (yes, yes, >> I've sense replaced this with manual downing logic -- but that is on our >> dev deployment and this issue happened on our staging deployment), elected >> a new leader and carried on -- even if this node became catatonic. >> >> >> This raises another point: When the ClusterDaemon shuts itself down, it >> would appear that I should handle some event here (not sure how to do >> that), to cause the entire JVM to terminate. This would cause AWS/ECS to >> launch a new instance to join the remaining cluster. >> >> >> Thoughts? -- Eric >> >> >> >> >> >> -- >> >> Read the docs: http://akka.io/docs/ >> >> Check the FAQ: >> http://doc.akka.io/docs/akka/current/additional/faq.html >> >> Search the archives: https://groups.google.com/group/akka-user >> --- >> You received this message because you are subscribed to the Google Groups >> "Akka User List" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to akka-user+...@googlegroups.com . >> To post to this group, send email to akka...@googlegroups.com >> . >> Visit this group at https://groups.google.com/group/akka-user. >> For more options, visit https://groups.google.com/d/optout. >> > -- >> Read the docs: http://akka.io/docs/ >> Check the FAQ: >> http://doc.akka.io/docs/akka/current/additional/faq.html >> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscr...@googlegroups.com. To post to this group, send
Re: [akka-user] Re: Akka cluster node shutting down in the middle of processing requests
I have not looked at the logs but you find answer to your last question in http://doc.akka.io/docs/akka/2.4/scala/cluster-usage.html#How_To_Cleanup_when_Member_is_Removed /Patrik fre 5 aug. 2016 kl. 22:31 skrev Eric Swenson: > One more clue as to the cluster daemon's shutting itself down. Earlier in > the logs (although prior to several successful requests being handled), I > find this: > > [INFO] [08/05/2016 05:04:45.042] > [ClusterSystem-akka.actor.default-dispatcher-5] > [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp:// > ClusterSystem@10.0.3.103:2552] - Leader can currently not perform its > duties, reachability status: [akka.tcp://ClusterSystem@10.0.3.103:2552 -> > akka.tcp://ClusterSystem@10.0.3.102:2552: Unreachable [Unreachable] (16), > akka.tcp://ClusterSystem@10.0.3.103:2552 -> akka.tcp:// > ClusterSystem@10.0.3.104:2552: Unreachable [Unreachable] (17), akka.tcp:// > ClusterSystem@10.0.3.103:2552 -> akka.tcp://ClusterSystem@10.0.3.176:2552: > Unreachable [Unreachable] (18), akka.tcp://ClusterSystem@10.0.3.103:2552 > -> akka.tcp://ClusterSystem@10.0.3.240:2552: Unreachable [Unreachable] > (19)], member status: [akka.tcp://ClusterSystem@10.0.3.102:2552 Up > seen=false, akka.tcp://ClusterSystem@10.0.3.103:2552 Up seen=true, > akka.tcp://ClusterSystem@10.0.3.104:2552 Up seen=false, akka.tcp:// > ClusterSystem@10.0.3.176:2552 Up seen=false, akka.tcp:// > ClusterSystem@10.0.3.240:2552 Up seen=false] > > > All these log messages are from the node at IP address 10.0.3.103. So I'm > assuming this means the Leader is THIS node. It seems to be saying that it > cannot reach all the other cluster members, and because of that, it cannot > do its job. This probably accounts for why it decided to shut itself down. > > > There were 6 AWS EC2 instances running this application at the time (not > 10, as I said in an earlier message). However, the cluster membership > above, only shows 5 members at the time of this log message. Not sure what > happened to the other one. > > > [akka.tcp://ClusterSystem@10.0.3.102:2552 Up seen=false, > > akka.tcp://ClusterSystem@10.0.3.103:2552 Up seen=true, > > akka.tcp://ClusterSystem@10.0.3.104:2552 Up seen=false, > > akka.tcp://ClusterSystem@10.0.3.176:2552 Up seen=false, > > akka.tcp://ClusterSystem@10.0.3.240:2552 Up seen=false] > > > I'm going to assume, not having any other evidence, that AWS/EC2 > experienced some network issue at the time in question, and consequently > this node was not able to talk to the rest of the cluster and therefore > this member (the leader) shut down. I only have logs for one of the other > 5 cluster nodes, so I will check to see what that other node thought about > all this at the time. But I'm not very comfortable with the robustness of > akka here. I would have thought that the other cluster members could have, > perhaps, noticing that the Leader was unreachable (assuming they couldn't > reach it), and because I had auto-down-unreachable-after set (yes, yes, > I've sense replaced this with manual downing logic -- but that is on our > dev deployment and this issue happened on our staging deployment), elected > a new leader and carried on -- even if this node became catatonic. > > > This raises another point: When the ClusterDaemon shuts itself down, it > would appear that I should handle some event here (not sure how to do > that), to cause the entire JVM to terminate. This would cause AWS/ECS to > launch a new instance to join the remaining cluster. > > > Thoughts? -- Eric > > > > > > -- > >> Read the docs: http://akka.io/docs/ > >> Check the FAQ: > http://doc.akka.io/docs/akka/current/additional/faq.html > >> Search the archives: https://groups.google.com/group/akka-user > --- > You received this message because you are subscribed to the Google Groups > "Akka User List" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to akka-user+unsubscr...@googlegroups.com. > To post to this group, send email to akka-user@googlegroups.com. > Visit this group at https://groups.google.com/group/akka-user. > For more options, visit https://groups.google.com/d/optout. > -- >> Read the docs: http://akka.io/docs/ >> Check the FAQ: >> http://doc.akka.io/docs/akka/current/additional/faq.html >> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscr...@googlegroups.com. To post to this group, send email to akka-user@googlegroups.com. Visit this group at https://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.
[akka-user] Re: Akka cluster node shutting down in the middle of processing requests
One more clue as to the cluster daemon's shutting itself down. Earlier in the logs (although prior to several successful requests being handled), I find this: [INFO] [08/05/2016 05:04:45.042] [ClusterSystem-akka.actor.default-dispatcher-5] [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@10.0.3.103:2552] - Leader can currently not perform its duties, reachability status: [akka.tcp://ClusterSystem@10.0.3.103:2552 -> akka.tcp://ClusterSystem@10.0.3.102:2552: Unreachable [Unreachable] (16), akka.tcp://ClusterSystem@10.0.3.103:2552 -> akka.tcp://ClusterSystem@10.0.3.104:2552: Unreachable [Unreachable] (17), akka.tcp://ClusterSystem@10.0.3.103:2552 -> akka.tcp://ClusterSystem@10.0.3.176:2552: Unreachable [Unreachable] (18), akka.tcp://ClusterSystem@10.0.3.103:2552 -> akka.tcp://ClusterSystem@10.0.3.240:2552: Unreachable [Unreachable] (19)], member status: [akka.tcp://ClusterSystem@10.0.3.102:2552 Up seen=false, akka.tcp://ClusterSystem@10.0.3.103:2552 Up seen=true, akka.tcp://ClusterSystem@10.0.3.104:2552 Up seen=false, akka.tcp://ClusterSystem@10.0.3.176:2552 Up seen=false, akka.tcp://ClusterSystem@10.0.3.240:2552 Up seen=false] All these log messages are from the node at IP address 10.0.3.103. So I'm assuming this means the Leader is THIS node. It seems to be saying that it cannot reach all the other cluster members, and because of that, it cannot do its job. This probably accounts for why it decided to shut itself down. There were 6 AWS EC2 instances running this application at the time (not 10, as I said in an earlier message). However, the cluster membership above, only shows 5 members at the time of this log message. Not sure what happened to the other one. [akka.tcp://ClusterSystem@10.0.3.102:2552 Up seen=false, akka.tcp://ClusterSystem@10.0.3.103:2552 Up seen=true, akka.tcp://ClusterSystem@10.0.3.104:2552 Up seen=false, akka.tcp://ClusterSystem@10.0.3.176:2552 Up seen=false, akka.tcp://ClusterSystem@10.0.3.240:2552 Up seen=false] I'm going to assume, not having any other evidence, that AWS/EC2 experienced some network issue at the time in question, and consequently this node was not able to talk to the rest of the cluster and therefore this member (the leader) shut down. I only have logs for one of the other 5 cluster nodes, so I will check to see what that other node thought about all this at the time. But I'm not very comfortable with the robustness of akka here. I would have thought that the other cluster members could have, perhaps, noticing that the Leader was unreachable (assuming they couldn't reach it), and because I had auto-down-unreachable-after set (yes, yes, I've sense replaced this with manual downing logic -- but that is on our dev deployment and this issue happened on our staging deployment), elected a new leader and carried on -- even if this node became catatonic. This raises another point: When the ClusterDaemon shuts itself down, it would appear that I should handle some event here (not sure how to do that), to cause the entire JVM to terminate. This would cause AWS/ECS to launch a new instance to join the remaining cluster. Thoughts? -- Eric -- >> Read the docs: http://akka.io/docs/ >> Check the FAQ: >> http://doc.akka.io/docs/akka/current/additional/faq.html >> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscr...@googlegroups.com. To post to this group, send email to akka-user@googlegroups.com. Visit this group at https://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.
[akka-user] Re: Akka cluster node shutting down in the middle of processing requests
Also, what does this message mean? I saw it earlier on in the logs: [DEBUG] [08/05/2016 05:04:50.450] [ClusterSystem-akka.actor.default-dispatcher-17] [akka://ClusterSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.3.176%3A2552-6] unhandled message from Actor[akka://ClusterSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.3.176%3A2552-6#1200432312]: Ungate On Friday, August 5, 2016 at 12:58:55 PM UTC-7, Eric Swenson wrote: > > Our akka-cluster-sharding service went down last night. In the middle of > processing akka-http requests (and sending these requests to a sharding > region for processing) on a 10-node cluster, one of the requests got an > "ask timeout" exception: > > [ERROR] [08/05/2016 05:04:51.077] > [ClusterSystem-akka.actor.default-dispatcher-16] > [akka.actor.ActorSystemImpl(ClusterSystem)] Error during processing of > request HttpRequest(HttpMetho\ > > d(GET), > http://eim.staging.example.com/eim/check/a0afbad4-69a8-4487-a6fb-f3e884a8d0aa?cache=false=15,List(Host: > > eim.staging.example.com, X-Real-Ip: 10.0.3.9, X-Forwarded-Fo\ > > r: 10.0.3.9, Connection: upgrade, Accept: */*, Accept-Encoding: gzip, > deflate, compress, Authorization: Bearer > aaa-1ZdrFpgR5AyOGa69Q2s3fwv_y5zz9UCL5F85Hc, User-Agent: python-requests\ > > /2.2.1 CPython/2.7.6 Linux/3.13.0-74-generic, Timeout-Access: > ),HttpEntity.Strict(application/json,),HttpProtocol(HTTP/1.1)) > > > akka.pattern.AskTimeoutException: > Recipient[Actor[akka://ClusterSystem/system/sharding/ExperimentInstance#-1675878517]] > > had already been terminated. Sender[null] sent the message of t\ > > ype "com.genecloud.eim.ExperimentInstance$Commands$CheckExperiment". > > > > As the error message says, the reason for the ask timeout was because the > actor (sharding region?) had been terminated. > > Looking back in the logs, I see that everything was going well for quite > some time, until the following: > > > [DEBUG] [08/05/2016 05:04:50.480] > [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp:// > ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] Resolving > login.dev.example.com before connecting > > [DEBUG] [08/05/2016 05:04:50.480] > [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp:// > ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] Attempting > connection to [login.dev.example.com/52.14.30.100:443] > > [DEBUG] [08/05/2016 05:04:50.481] > [ClusterSystem-akka.actor.default-dispatcher-6] [akka.tcp:// > ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] Connection > established to [login.dev.example.com:443] > > [DEBUG] [08/05/2016 05:04:50.481] > [ClusterSystem-akka.actor.default-dispatcher-5] > [akka://ClusterSystem/system/IO-TCP] no longer watched by > Actor[akka://ClusterSystem/user/StreamSupervisor-7/$$H#-1778344261] > > [DEBUG] [08/05/2016 05:04:50.481] > [ClusterSystem-akka.actor.default-dispatcher-5] > [akka://ClusterSystem/system/IO-TCP/selectors/$a/955] now watched by > Actor[akka://ClusterSystem/user/StreamSupervisor-7/$$H#-1778344261] > > [DEBUG] [08/05/2016 05:04:50.483] > [ClusterSystem-akka.actor.default-dispatcher-17] > [akka://ClusterSystem/user/StreamSupervisor-7] now supervising > Actor[akka://ClusterSystem/user/Strea > mSupervisor-7/flow-993-1-unknown-operation#819117275] > > [DEBUG] [08/05/2016 05:04:50.484] > [ClusterSystem-akka.actor.default-dispatcher-2] > [akka://ClusterSystem/user/StreamSupervisor-7/flow-993-1-unknown-operation] > started (akka.stream.impl.io.TLSActor@66dd942c) > > [INFO] [08/05/2016 05:04:50.526] > [ClusterSystem-akka.actor.default-dispatcher-17] > [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp:// > ClusterSystem@10.0.3.103:2552] - Shutting down myself > > [INFO] [08/05/2016 05:04:50.527] > [ClusterSystem-akka.actor.default-dispatcher-17] > [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp:// > ClusterSystem@10.0.3.103:2552] - Shutting down... > > [DEBUG] [08/05/2016 05:04:50.528] > [ClusterSystem-akka.actor.default-dispatcher-3] > [akka://ClusterSystem/system/cluster/core] stopping > > [DEBUG] [08/05/2016 05:04:50.528] > [ClusterSystem-akka.actor.default-dispatcher-16] > [akka://ClusterSystem/system/cluster/heartbeatReceiver] stopped > > [DEBUG] [08/05/2016 05:04:50.539] > [ClusterSystem-akka.actor.default-dispatcher-16] > [akka://ClusterSystem/system/cluster/metrics] stopped > > [DEBUG] [08/05/2016 05:04:50.540] > [ClusterSystem-akka.actor.default-dispatcher-18] > [akka://ClusterSystem/system/cluster] stopping > > [INFO] [08/05/2016 05:04:50.573] > [ClusterSystem-akka.actor.default-dispatcher-18] [akka.tcp:// > ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstanceCoordinator] > Self removed\ > >