Re: [akka-user] Re: Akka cluster node shutting down in the middle of processing requests

2016-08-12 Thread Eric Swenson
Thanks. I've added that code fragment to our application.  

On Thursday, August 11, 2016 at 11:05:12 AM UTC-7, Patrik Nordwall wrote:
>
> I have not looked at the logs but you find answer to your last question in 
> http://doc.akka.io/docs/akka/2.4/scala/cluster-usage.html#How_To_Cleanup_when_Member_is_Removed
>
> /Patrik
>
> fre 5 aug. 2016 kl. 22:31 skrev Eric Swenson  >:
>
>> One more clue as to the cluster daemon's shutting itself down.  Earlier 
>> in the logs (although prior to several successful requests being handled), 
>> I find this:
>>
>> [INFO] [08/05/2016 05:04:45.042] 
>> [ClusterSystem-akka.actor.default-dispatcher-5] 
>> [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://
>> ClusterSystem@10.0.3.103:2552] - Leader can currently not perform its 
>> duties, reachability status: [akka.tcp://ClusterSystem@10.0.3.103:2552 
>> -> akka.tcp://ClusterSystem@10.0.3.102:2552: Unreachable [Unreachable] 
>> (16), akka.tcp://ClusterSystem@10.0.3.103:2552 -> akka.tcp://
>> ClusterSystem@10.0.3.104:2552: Unreachable [Unreachable] (17), 
>> akka.tcp://ClusterSystem@10.0.3.103:2552 -> akka.tcp://
>> ClusterSystem@10.0.3.176:2552: Unreachable [Unreachable] (18), 
>> akka.tcp://ClusterSystem@10.0.3.103:2552 -> akka.tcp://
>> ClusterSystem@10.0.3.240:2552: Unreachable [Unreachable] (19)], member 
>> status: [akka.tcp://ClusterSystem@10.0.3.102:2552 Up seen=false, 
>> akka.tcp://ClusterSystem@10.0.3.103:2552 Up seen=true, akka.tcp://
>> ClusterSystem@10.0.3.104:2552 Up seen=false, akka.tcp://
>> ClusterSystem@10.0.3.176:2552 Up seen=false, akka.tcp://
>> ClusterSystem@10.0.3.240:2552 Up seen=false]
>>
>>
>> All these log messages are from the node at IP address 10.0.3.103.  So 
>> I'm assuming this means the Leader is THIS node.  It seems to be saying 
>> that it cannot reach all the other cluster members, and because of that, it 
>> cannot do its job. This probably accounts for why it decided to shut itself 
>> down.  
>>
>>
>> There were 6 AWS EC2 instances running this application at the time (not 
>> 10, as I said in an earlier message).  However, the cluster membership 
>> above, only shows 5 members at the time of this log message.  Not sure what 
>> happened to the other one.  
>>
>>
>> [akka.tcp://ClusterSystem@10.0.3.102:2552 Up seen=false,
>>
>>  akka.tcp://ClusterSystem@10.0.3.103:2552 Up seen=true,
>>
>>  akka.tcp://ClusterSystem@10.0.3.104:2552 Up seen=false,
>>
>>  akka.tcp://ClusterSystem@10.0.3.176:2552 Up seen=false,
>>
>>  akka.tcp://ClusterSystem@10.0.3.240:2552 Up seen=false]
>>
>>
>> I'm going to assume, not having any other evidence, that AWS/EC2 
>> experienced some network issue at the time in question, and consequently 
>> this node was not able to talk to the rest of the cluster and therefore 
>> this member (the leader) shut down.  I only have logs for one of the other 
>> 5 cluster nodes, so I will check to see what that other node thought about 
>> all this at the time.  But I'm not very comfortable with the robustness of 
>> akka here.  I would have thought that the other cluster members could have, 
>> perhaps, noticing that the Leader was unreachable (assuming they couldn't 
>> reach it), and because I had auto-down-unreachable-after set (yes, yes, 
>> I've sense replaced this with manual downing logic -- but that is on our 
>> dev deployment and this issue happened on our staging deployment), elected 
>> a new leader and carried on -- even if this node became catatonic.  
>>
>>
>> This raises another point:  When the ClusterDaemon shuts itself down, it 
>> would appear that I should handle some event here (not sure how to do 
>> that), to cause the entire JVM to terminate.  This would cause AWS/ECS to 
>> launch a new instance to join the remaining cluster.
>>
>>
>> Thoughts?  -- Eric
>>
>>
>>
>>
>>
>> -- 
>> >> Read the docs: http://akka.io/docs/
>> >> Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>> >> Search the archives: https://groups.google.com/group/akka-user
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Akka User List" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to akka-user+...@googlegroups.com .
>> To post to this group, send email to akka...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/akka-user.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
>>  Read the docs: http://akka.io/docs/
>>  Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>  Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send 

Re: [akka-user] Re: Akka cluster node shutting down in the middle of processing requests

2016-08-11 Thread Patrik Nordwall
I have not looked at the logs but you find answer to your last question in
http://doc.akka.io/docs/akka/2.4/scala/cluster-usage.html#How_To_Cleanup_when_Member_is_Removed

/Patrik

fre 5 aug. 2016 kl. 22:31 skrev Eric Swenson :

> One more clue as to the cluster daemon's shutting itself down.  Earlier in
> the logs (although prior to several successful requests being handled), I
> find this:
>
> [INFO] [08/05/2016 05:04:45.042]
> [ClusterSystem-akka.actor.default-dispatcher-5]
> [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://
> ClusterSystem@10.0.3.103:2552] - Leader can currently not perform its
> duties, reachability status: [akka.tcp://ClusterSystem@10.0.3.103:2552 ->
> akka.tcp://ClusterSystem@10.0.3.102:2552: Unreachable [Unreachable] (16),
> akka.tcp://ClusterSystem@10.0.3.103:2552 -> akka.tcp://
> ClusterSystem@10.0.3.104:2552: Unreachable [Unreachable] (17), akka.tcp://
> ClusterSystem@10.0.3.103:2552 -> akka.tcp://ClusterSystem@10.0.3.176:2552:
> Unreachable [Unreachable] (18), akka.tcp://ClusterSystem@10.0.3.103:2552
> -> akka.tcp://ClusterSystem@10.0.3.240:2552: Unreachable [Unreachable]
> (19)], member status: [akka.tcp://ClusterSystem@10.0.3.102:2552 Up
> seen=false, akka.tcp://ClusterSystem@10.0.3.103:2552 Up seen=true,
> akka.tcp://ClusterSystem@10.0.3.104:2552 Up seen=false, akka.tcp://
> ClusterSystem@10.0.3.176:2552 Up seen=false, akka.tcp://
> ClusterSystem@10.0.3.240:2552 Up seen=false]
>
>
> All these log messages are from the node at IP address 10.0.3.103.  So I'm
> assuming this means the Leader is THIS node.  It seems to be saying that it
> cannot reach all the other cluster members, and because of that, it cannot
> do its job. This probably accounts for why it decided to shut itself down.
>
>
> There were 6 AWS EC2 instances running this application at the time (not
> 10, as I said in an earlier message).  However, the cluster membership
> above, only shows 5 members at the time of this log message.  Not sure what
> happened to the other one.
>
>
> [akka.tcp://ClusterSystem@10.0.3.102:2552 Up seen=false,
>
>  akka.tcp://ClusterSystem@10.0.3.103:2552 Up seen=true,
>
>  akka.tcp://ClusterSystem@10.0.3.104:2552 Up seen=false,
>
>  akka.tcp://ClusterSystem@10.0.3.176:2552 Up seen=false,
>
>  akka.tcp://ClusterSystem@10.0.3.240:2552 Up seen=false]
>
>
> I'm going to assume, not having any other evidence, that AWS/EC2
> experienced some network issue at the time in question, and consequently
> this node was not able to talk to the rest of the cluster and therefore
> this member (the leader) shut down.  I only have logs for one of the other
> 5 cluster nodes, so I will check to see what that other node thought about
> all this at the time.  But I'm not very comfortable with the robustness of
> akka here.  I would have thought that the other cluster members could have,
> perhaps, noticing that the Leader was unreachable (assuming they couldn't
> reach it), and because I had auto-down-unreachable-after set (yes, yes,
> I've sense replaced this with manual downing logic -- but that is on our
> dev deployment and this issue happened on our staging deployment), elected
> a new leader and carried on -- even if this node became catatonic.
>
>
> This raises another point:  When the ClusterDaemon shuts itself down, it
> would appear that I should handle some event here (not sure how to do
> that), to cause the entire JVM to terminate.  This would cause AWS/ECS to
> launch a new instance to join the remaining cluster.
>
>
> Thoughts?  -- Eric
>
>
>
>
>
> --
> >> Read the docs: http://akka.io/docs/
> >> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to akka-user+unsubscr...@googlegroups.com.
> To post to this group, send email to akka-user@googlegroups.com.
> Visit this group at https://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
>>  Read the docs: http://akka.io/docs/
>>  Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>  Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.


[akka-user] Re: Akka cluster node shutting down in the middle of processing requests

2016-08-05 Thread Eric Swenson
One more clue as to the cluster daemon's shutting itself down.  Earlier in 
the logs (although prior to several successful requests being handled), I 
find this:

[INFO] [08/05/2016 05:04:45.042] 
[ClusterSystem-akka.actor.default-dispatcher-5] 
[akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node 
[akka.tcp://ClusterSystem@10.0.3.103:2552] - Leader can currently not 
perform its duties, reachability status: 
[akka.tcp://ClusterSystem@10.0.3.103:2552 -> 
akka.tcp://ClusterSystem@10.0.3.102:2552: Unreachable [Unreachable] (16), 
akka.tcp://ClusterSystem@10.0.3.103:2552 -> 
akka.tcp://ClusterSystem@10.0.3.104:2552: Unreachable [Unreachable] (17), 
akka.tcp://ClusterSystem@10.0.3.103:2552 -> 
akka.tcp://ClusterSystem@10.0.3.176:2552: Unreachable [Unreachable] (18), 
akka.tcp://ClusterSystem@10.0.3.103:2552 -> 
akka.tcp://ClusterSystem@10.0.3.240:2552: Unreachable [Unreachable] (19)], 
member status: [akka.tcp://ClusterSystem@10.0.3.102:2552 Up seen=false, 
akka.tcp://ClusterSystem@10.0.3.103:2552 Up seen=true, 
akka.tcp://ClusterSystem@10.0.3.104:2552 Up seen=false, 
akka.tcp://ClusterSystem@10.0.3.176:2552 Up seen=false, 
akka.tcp://ClusterSystem@10.0.3.240:2552 Up seen=false]


All these log messages are from the node at IP address 10.0.3.103.  So I'm 
assuming this means the Leader is THIS node.  It seems to be saying that it 
cannot reach all the other cluster members, and because of that, it cannot 
do its job. This probably accounts for why it decided to shut itself down.  


There were 6 AWS EC2 instances running this application at the time (not 
10, as I said in an earlier message).  However, the cluster membership 
above, only shows 5 members at the time of this log message.  Not sure what 
happened to the other one.  


[akka.tcp://ClusterSystem@10.0.3.102:2552 Up seen=false,

 akka.tcp://ClusterSystem@10.0.3.103:2552 Up seen=true,

 akka.tcp://ClusterSystem@10.0.3.104:2552 Up seen=false,

 akka.tcp://ClusterSystem@10.0.3.176:2552 Up seen=false,

 akka.tcp://ClusterSystem@10.0.3.240:2552 Up seen=false]


I'm going to assume, not having any other evidence, that AWS/EC2 
experienced some network issue at the time in question, and consequently 
this node was not able to talk to the rest of the cluster and therefore 
this member (the leader) shut down.  I only have logs for one of the other 
5 cluster nodes, so I will check to see what that other node thought about 
all this at the time.  But I'm not very comfortable with the robustness of 
akka here.  I would have thought that the other cluster members could have, 
perhaps, noticing that the Leader was unreachable (assuming they couldn't 
reach it), and because I had auto-down-unreachable-after set (yes, yes, 
I've sense replaced this with manual downing logic -- but that is on our 
dev deployment and this issue happened on our staging deployment), elected 
a new leader and carried on -- even if this node became catatonic.  


This raises another point:  When the ClusterDaemon shuts itself down, it 
would appear that I should handle some event here (not sure how to do 
that), to cause the entire JVM to terminate.  This would cause AWS/ECS to 
launch a new instance to join the remaining cluster.


Thoughts?  -- Eric





-- 
>>  Read the docs: http://akka.io/docs/
>>  Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>  Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.


[akka-user] Re: Akka cluster node shutting down in the middle of processing requests

2016-08-05 Thread Eric Swenson
Also, what does this message mean?  I saw it earlier on in the logs:

[DEBUG] [08/05/2016 05:04:50.450] 
[ClusterSystem-akka.actor.default-dispatcher-17] 
[akka://ClusterSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.3.176%3A2552-6]
 
unhandled message from 
Actor[akka://ClusterSystem/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%4010.0.3.176%3A2552-6#1200432312]:
 
Ungate

On Friday, August 5, 2016 at 12:58:55 PM UTC-7, Eric Swenson wrote:
>
> Our akka-cluster-sharding service went down last night. In the middle of 
> processing akka-http requests (and sending these requests to a sharding 
> region for processing) on a 10-node cluster, one of the requests got an 
> "ask timeout" exception:
>
> [ERROR] [08/05/2016 05:04:51.077] 
> [ClusterSystem-akka.actor.default-dispatcher-16] 
> [akka.actor.ActorSystemImpl(ClusterSystem)] Error during processing of 
> request HttpRequest(HttpMetho\
>
> d(GET),
> http://eim.staging.example.com/eim/check/a0afbad4-69a8-4487-a6fb-f3e884a8d0aa?cache=false=15,List(Host:
>  
> eim.staging.example.com, X-Real-Ip: 10.0.3.9, X-Forwarded-Fo\
>
> r: 10.0.3.9, Connection: upgrade, Accept: */*, Accept-Encoding: gzip, 
> deflate, compress, Authorization: Bearer 
> aaa-1ZdrFpgR5AyOGa69Q2s3fwv_y5zz9UCL5F85Hc, User-Agent: python-requests\
>
> /2.2.1 CPython/2.7.6 Linux/3.13.0-74-generic, Timeout-Access: 
> ),HttpEntity.Strict(application/json,),HttpProtocol(HTTP/1.1))  
>
>
> akka.pattern.AskTimeoutException: 
> Recipient[Actor[akka://ClusterSystem/system/sharding/ExperimentInstance#-1675878517]]
>  
> had already been terminated. Sender[null] sent the message of t\
>
> ype "com.genecloud.eim.ExperimentInstance$Commands$CheckExperiment".   
> 
>  
>
> As the error message says, the reason for the ask timeout was because the 
> actor (sharding region?) had been terminated.  
>
> Looking back in the logs, I see that everything was going well for quite 
> some time, until the following:
>
>
> [DEBUG] [08/05/2016 05:04:50.480] 
> [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] Resolving 
> login.dev.example.com before connecting
>
> [DEBUG] [08/05/2016 05:04:50.480] 
> [ClusterSystem-akka.actor.default-dispatcher-2] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] Attempting 
> connection to [login.dev.example.com/52.14.30.100:443]
>
> [DEBUG] [08/05/2016 05:04:50.481] 
> [ClusterSystem-akka.actor.default-dispatcher-6] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/IO-TCP/selectors/$a/955] Connection 
> established to [login.dev.example.com:443]
>
> [DEBUG] [08/05/2016 05:04:50.481] 
> [ClusterSystem-akka.actor.default-dispatcher-5] 
> [akka://ClusterSystem/system/IO-TCP] no longer watched by 
> Actor[akka://ClusterSystem/user/StreamSupervisor-7/$$H#-1778344261]
>
> [DEBUG] [08/05/2016 05:04:50.481] 
> [ClusterSystem-akka.actor.default-dispatcher-5] 
> [akka://ClusterSystem/system/IO-TCP/selectors/$a/955] now watched by 
> Actor[akka://ClusterSystem/user/StreamSupervisor-7/$$H#-1778344261]
>
> [DEBUG] [08/05/2016 05:04:50.483] 
> [ClusterSystem-akka.actor.default-dispatcher-17] 
> [akka://ClusterSystem/user/StreamSupervisor-7] now supervising 
> Actor[akka://ClusterSystem/user/Strea
> mSupervisor-7/flow-993-1-unknown-operation#819117275]
>
> [DEBUG] [08/05/2016 05:04:50.484] 
> [ClusterSystem-akka.actor.default-dispatcher-2] 
> [akka://ClusterSystem/user/StreamSupervisor-7/flow-993-1-unknown-operation] 
> started (akka.stream.impl.io.TLSActor@66dd942c)
>
> [INFO] [08/05/2016 05:04:50.526] 
> [ClusterSystem-akka.actor.default-dispatcher-17] 
> [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://
> ClusterSystem@10.0.3.103:2552] - Shutting down myself
>
> [INFO] [08/05/2016 05:04:50.527] 
> [ClusterSystem-akka.actor.default-dispatcher-17] 
> [akka.cluster.Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://
> ClusterSystem@10.0.3.103:2552] - Shutting down...
>
> [DEBUG] [08/05/2016 05:04:50.528] 
> [ClusterSystem-akka.actor.default-dispatcher-3] 
> [akka://ClusterSystem/system/cluster/core] stopping
>
> [DEBUG] [08/05/2016 05:04:50.528] 
> [ClusterSystem-akka.actor.default-dispatcher-16] 
> [akka://ClusterSystem/system/cluster/heartbeatReceiver] stopped
>
> [DEBUG] [08/05/2016 05:04:50.539] 
> [ClusterSystem-akka.actor.default-dispatcher-16] 
> [akka://ClusterSystem/system/cluster/metrics] stopped
>
> [DEBUG] [08/05/2016 05:04:50.540] 
> [ClusterSystem-akka.actor.default-dispatcher-18] 
> [akka://ClusterSystem/system/cluster] stopping
>
> [INFO] [08/05/2016 05:04:50.573] 
> [ClusterSystem-akka.actor.default-dispatcher-18] [akka.tcp://
> ClusterSystem@10.0.3.103:2552/system/sharding/ExperimentInstanceCoordinator] 
> Self removed\
>
>