Re: [akka-user] Recovering from the Quarantined state

2015-02-06 Thread Patrik Nordwall
You should probably also look into why they are quarantined.

It can be two reasons:

1) The nodes are removed from the cluster, which will happen if failure
detection triggers, you use auto-downing and they don't become reachable
again within the configured akka.cluster.auto-down-unreachable-after
timeout. You might want to increase the auto-down timeout?

2) Overflow of the system message delivery buffer, because of many remote
watch or remote deployments. You might want to increase the
akka.remote.system-message-buffer-size, or adjust your design?

Cheers,
Patrik

On Fri, Feb 6, 2015 at 10:58 AM, Akka Team  wrote:

> Hi Mark,
>
> On Tue, Feb 3, 2015 at 5:13 PM, Mark Kegel  wrote:
>
>> We are using akka 2.3.4, but I don't think this is an issue with a
>> specific version of akka. In fact the docs explicitly state that you have
>> to restart the akka node after its been Quarantined.
>>
>> I'm looking for some way to detect that my node has been quarantined so
>> that I can force an exit, so that our puppet system can restart it, or just
>> restart the akka system programmatically without exiting the process. This
>> seems like basic error handling and recovery but I see nothing in the docs
>> on how a person is supposed to handle this, or how they can even be
>> notified of the issue.
>>
>
> I agree that we can improve the documentation around this. The remoting
> publishes events that you can subscribe to:
>
> http://doc.akka.io/docs/akka/2.3.9/scala/remoting.html#Remote_Events
>
> One of those published events notifies of quarantine:
> http://doc.akka.io/api/akka/2.3.9/#akka.remote.QuarantinedEvent
>
> -Endre
>
>
>> Is there any kind of exception that bubbles back to user code, or a
>> cluster state message that I can receive, for when my local akka instance
>> can't rejoin the cluster?
>>
>> Is there any way a supervisor hierarchy can help solve this problem?
>>
>> If someone can point me to code that is able to respond and recover from
>> such failures intelligently, and using akka approved idioms, that would be
>> most appreciated.
>>
>
>
>> Mark
>>
>>
>>
>> On Tuesday, February 3, 2015 at 6:32:20 AM UTC-6, Patrik Nordwall wrote:
>>>
>>> What version of Akka are you using? We fixed some issue related to
>>> quarantining in 2.3.9.
>>> /Patrik
>>>
>>> On Mon, Jan 26, 2015 at 5:20 PM, Mark Kegel  wrote:
>>>
 We are using akka in a clustered configuration at work. Its a very
 simple cluster with just three node types: an admin node, "live" nodes, and
 "preview" nodes. The admin node will manage nodes of the other two types,
 and ask for things like status and uptime. Every so often one of the
 live/preview nodes will become unresponsive to requests from the admin
 node. The only way we've been able to fix this is to restart the node.

 From reading the akka docs this seems to correspond to the node
 becoming Quarantined. While I appreciate that this state is necessary to
 maintain consistency, I'm at a loss in finding docs that show how to
 respond in code when this happens. On our admin node we'll know that some
 other live/preview node has failed and will require a restart, but what
 would work best is if we could have a service watching locally on the
 failed live/preview node that could force a restart of that nodes' JVM.

 Is there any kind of exception that bubbles back to user code, or a
 cluster state message that I can receive, for when my local akka instance
 can't rejoin the cluster?

 Is there any way a supervisor hierarchy can help solve this problem?

 If someone can point me to code that is able to respond and recover
 from such failures intelligently, and using akka approved idioms, that
 would be most appreciated.

 Mark

 --
 >> Read the docs: http://akka.io/docs/
 >> Check the FAQ: http://doc.akka.io/docs/akka/
 current/additional/faq.html
 >> Search the archives: https://groups.google.com/
 group/akka-user
 ---
 You received this message because you are subscribed to the Google
 Groups "Akka User List" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to akka-user+...@googlegroups.com.
 To post to this group, send email to akka...@googlegroups.com.
 Visit this group at http://groups.google.com/group/akka-user.
 For more options, visit https://groups.google.com/d/optout.

>>>
>>>
>>>
>>> --
>>>
>>> Patrik Nordwall
>>> Typesafe  -  Reactive apps on the JVM
>>> Twitter: @patriknw
>>>
>>>   --
>> >> Read the docs: http://akka.io/docs/
>> >> Check the FAQ:
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>> >> Search the archives: https://groups.google.com/group/akka-user
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Akka User List" group.
>> To unsubscribe from thi

Re: [akka-user] Recovering from the Quarantined state

2015-02-06 Thread Akka Team
Hi Mark,

On Tue, Feb 3, 2015 at 5:13 PM, Mark Kegel  wrote:

> We are using akka 2.3.4, but I don't think this is an issue with a
> specific version of akka. In fact the docs explicitly state that you have
> to restart the akka node after its been Quarantined.
>
> I'm looking for some way to detect that my node has been quarantined so
> that I can force an exit, so that our puppet system can restart it, or just
> restart the akka system programmatically without exiting the process. This
> seems like basic error handling and recovery but I see nothing in the docs
> on how a person is supposed to handle this, or how they can even be
> notified of the issue.
>

I agree that we can improve the documentation around this. The remoting
publishes events that you can subscribe to:

http://doc.akka.io/docs/akka/2.3.9/scala/remoting.html#Remote_Events

One of those published events notifies of quarantine:
http://doc.akka.io/api/akka/2.3.9/#akka.remote.QuarantinedEvent

-Endre


> Is there any kind of exception that bubbles back to user code, or a
> cluster state message that I can receive, for when my local akka instance
> can't rejoin the cluster?
>
> Is there any way a supervisor hierarchy can help solve this problem?
>
> If someone can point me to code that is able to respond and recover from
> such failures intelligently, and using akka approved idioms, that would be
> most appreciated.
>


> Mark
>
>
>
> On Tuesday, February 3, 2015 at 6:32:20 AM UTC-6, Patrik Nordwall wrote:
>>
>> What version of Akka are you using? We fixed some issue related to
>> quarantining in 2.3.9.
>> /Patrik
>>
>> On Mon, Jan 26, 2015 at 5:20 PM, Mark Kegel  wrote:
>>
>>> We are using akka in a clustered configuration at work. Its a very
>>> simple cluster with just three node types: an admin node, "live" nodes, and
>>> "preview" nodes. The admin node will manage nodes of the other two types,
>>> and ask for things like status and uptime. Every so often one of the
>>> live/preview nodes will become unresponsive to requests from the admin
>>> node. The only way we've been able to fix this is to restart the node.
>>>
>>> From reading the akka docs this seems to correspond to the node becoming
>>> Quarantined. While I appreciate that this state is necessary to maintain
>>> consistency, I'm at a loss in finding docs that show how to respond in code
>>> when this happens. On our admin node we'll know that some other
>>> live/preview node has failed and will require a restart, but what would
>>> work best is if we could have a service watching locally on the failed
>>> live/preview node that could force a restart of that nodes' JVM.
>>>
>>> Is there any kind of exception that bubbles back to user code, or a
>>> cluster state message that I can receive, for when my local akka instance
>>> can't rejoin the cluster?
>>>
>>> Is there any way a supervisor hierarchy can help solve this problem?
>>>
>>> If someone can point me to code that is able to respond and recover from
>>> such failures intelligently, and using akka approved idioms, that would be
>>> most appreciated.
>>>
>>> Mark
>>>
>>> --
>>> >> Read the docs: http://akka.io/docs/
>>> >> Check the FAQ: http://doc.akka.io/docs/akka/
>>> current/additional/faq.html
>>> >> Search the archives: https://groups.google.com/
>>> group/akka-user
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "Akka User List" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to akka-user+...@googlegroups.com.
>>> To post to this group, send email to akka...@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/akka-user.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> --
>>
>> Patrik Nordwall
>> Typesafe  -  Reactive apps on the JVM
>> Twitter: @patriknw
>>
>>   --
> >> Read the docs: http://akka.io/docs/
> >> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to akka-user+unsubscr...@googlegroups.com.
> To post to this group, send email to akka-user@googlegroups.com.
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Akka Team
Typesafe - The software stack for applications that scale
Blog: letitcrash.com
Twitter: @akkateam

-- 
>>  Read the docs: http://akka.io/docs/
>>  Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>  Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" 

Re: [akka-user] Recovering from the Quarantined state

2015-02-03 Thread Mark Kegel
We are using akka 2.3.4, but I don't think this is an issue with a specific 
version of akka. In fact the docs explicitly state that you have to restart 
the akka node after its been Quarantined.

I'm looking for some way to detect that my node has been quarantined so 
that I can force an exit, so that our puppet system can restart it, or just 
restart the akka system programmatically without exiting the process. This 
seems like basic error handling and recovery but I see nothing in the docs 
on how a person is supposed to handle this, or how they can even be 
notified of the issue.

Is there any kind of exception that bubbles back to user code, or a cluster 
state message that I can receive, for when my local akka instance can't 
rejoin the cluster?

Is there any way a supervisor hierarchy can help solve this problem?

If someone can point me to code that is able to respond and recover from 
such failures intelligently, and using akka approved idioms, that would be 
most appreciated.

Mark



On Tuesday, February 3, 2015 at 6:32:20 AM UTC-6, Patrik Nordwall wrote:
>
> What version of Akka are you using? We fixed some issue related to 
> quarantining in 2.3.9.
> /Patrik
>
> On Mon, Jan 26, 2015 at 5:20 PM, Mark Kegel  > wrote:
>
>> We are using akka in a clustered configuration at work. Its a very simple 
>> cluster with just three node types: an admin node, "live" nodes, and 
>> "preview" nodes. The admin node will manage nodes of the other two types, 
>> and ask for things like status and uptime. Every so often one of the 
>> live/preview nodes will become unresponsive to requests from the admin 
>> node. The only way we've been able to fix this is to restart the node.
>>
>> From reading the akka docs this seems to correspond to the node becoming 
>> Quarantined. While I appreciate that this state is necessary to maintain 
>> consistency, I'm at a loss in finding docs that show how to respond in code 
>> when this happens. On our admin node we'll know that some other 
>> live/preview node has failed and will require a restart, but what would 
>> work best is if we could have a service watching locally on the failed 
>> live/preview node that could force a restart of that nodes' JVM.
>>
>> Is there any kind of exception that bubbles back to user code, or a 
>> cluster state message that I can receive, for when my local akka instance 
>> can't rejoin the cluster?
>>
>> Is there any way a supervisor hierarchy can help solve this problem?
>>
>> If someone can point me to code that is able to respond and recover from 
>> such failures intelligently, and using akka approved idioms, that would be 
>> most appreciated.
>>
>> Mark
>>
>> -- 
>> >> Read the docs: http://akka.io/docs/
>> >> Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>> >> Search the archives: https://groups.google.com/group/akka-user
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Akka User List" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to akka-user+...@googlegroups.com .
>> To post to this group, send email to akka...@googlegroups.com 
>> .
>> Visit this group at http://groups.google.com/group/akka-user.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
>
> Patrik Nordwall
> Typesafe  -  Reactive apps on the JVM
> Twitter: @patriknw
>
>  

-- 
>>  Read the docs: http://akka.io/docs/
>>  Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>  Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.


Re: [akka-user] Recovering from the Quarantined state

2015-02-03 Thread Patrik Nordwall
What version of Akka are you using? We fixed some issue related to
quarantining in 2.3.9.
/Patrik

On Mon, Jan 26, 2015 at 5:20 PM, Mark Kegel  wrote:

> We are using akka in a clustered configuration at work. Its a very simple
> cluster with just three node types: an admin node, "live" nodes, and
> "preview" nodes. The admin node will manage nodes of the other two types,
> and ask for things like status and uptime. Every so often one of the
> live/preview nodes will become unresponsive to requests from the admin
> node. The only way we've been able to fix this is to restart the node.
>
> From reading the akka docs this seems to correspond to the node becoming
> Quarantined. While I appreciate that this state is necessary to maintain
> consistency, I'm at a loss in finding docs that show how to respond in code
> when this happens. On our admin node we'll know that some other
> live/preview node has failed and will require a restart, but what would
> work best is if we could have a service watching locally on the failed
> live/preview node that could force a restart of that nodes' JVM.
>
> Is there any kind of exception that bubbles back to user code, or a
> cluster state message that I can receive, for when my local akka instance
> can't rejoin the cluster?
>
> Is there any way a supervisor hierarchy can help solve this problem?
>
> If someone can point me to code that is able to respond and recover from
> such failures intelligently, and using akka approved idioms, that would be
> most appreciated.
>
> Mark
>
> --
> >> Read the docs: http://akka.io/docs/
> >> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to akka-user+unsubscr...@googlegroups.com.
> To post to this group, send email to akka-user@googlegroups.com.
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>



-- 

Patrik Nordwall
Typesafe  -  Reactive apps on the JVM
Twitter: @patriknw

-- 
>>  Read the docs: http://akka.io/docs/
>>  Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>  Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.


[akka-user] Recovering from the Quarantined state

2015-01-26 Thread Mark Kegel
We are using akka in a clustered configuration at work. Its a very simple 
cluster with just three node types: an admin node, "live" nodes, and 
"preview" nodes. The admin node will manage nodes of the other two types, 
and ask for things like status and uptime. Every so often one of the 
live/preview nodes will become unresponsive to requests from the admin 
node. The only way we've been able to fix this is to restart the node.

>From reading the akka docs this seems to correspond to the node becoming 
Quarantined. While I appreciate that this state is necessary to maintain 
consistency, I'm at a loss in finding docs that show how to respond in code 
when this happens. On our admin node we'll know that some other 
live/preview node has failed and will require a restart, but what would 
work best is if we could have a service watching locally on the failed 
live/preview node that could force a restart of that nodes' JVM.

Is there any kind of exception that bubbles back to user code, or a cluster 
state message that I can receive, for when my local akka instance can't 
rejoin the cluster?

Is there any way a supervisor hierarchy can help solve this problem?

If someone can point me to code that is able to respond and recover from 
such failures intelligently, and using akka approved idioms, that would be 
most appreciated.

Mark

-- 
>>  Read the docs: http://akka.io/docs/
>>  Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>  Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.