I agree with Stephan. Once a TaskManager is quarantined the ActorSystem has
to be restarted in order to connect back to the JobManager.

Adding our own heartbeat mechanism should be fairly easy. However, we would
lose the more elaborate phi accrual failure detector which can react to
changing network conditions. But I’m not sure how much difference this
makes in reality.

Listening to the QuarantinedEvent and shutting down the TM should be easy
to do. If I remember correctly, we always wanted to add a restart loop to
the task manager start script in case that the TM exited with an error.

Maybe we first do the latter option and then re-add our own heartbeats
again.

Cheers,
Till
​

On Fri, Feb 5, 2016 at 9:27 AM, Gyula Fóra <gyula.f...@gmail.com> wrote:

> I am not an expert on this but I guess either could work. What is also
> interesting that sometimes this doesn't happen for a day sometimes it
> happens twice an hour so it's probably some network issue as well.
>
> Gyula
>
>
> Stephan Ewen <se...@apache.org> ezt írta (időpont: 2016. febr. 4., Cs,
> 16:32):
>
>> We should probably add to the TaskManager a "restart on quarantined"
>> strategy anyways.
>>
>> We can detect it as follows:
>> http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state
>>
>>
>> On Thu, Feb 4, 2016 at 5:18 PM, Stephan Ewen <se...@apache.org> wrote:
>>
>>> Okay, here are the docs for the Akka version we are using:
>>> http://doc.akka.io/docs/akka/2.3.14/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model
>>>
>>> It says that after a remote deathwatch trigger, the actor system must be
>>> restarted before it can connect again.
>>>
>>> We probably need to do the following:
>>>   - Either restart TaskManager actor system when it detects that it is
>>> quarantined (or when it senses the JobManager as failed)
>>>
>>>   - Or switch from Akka death watch to a manual heartbeat mechanism
>>>
>>> Would be good to also have Till's input on this...
>>>
>>> What do you think?
>>>
>>> Stephan
>>>
>>>
>>>
>>> On Thu, Feb 4, 2016 at 5:11 PM, Gyula Fóra <gyula.f...@gmail.com> wrote:
>>>
>>>> Yes exactly , it says it is quarantined.
>>>>
>>>> Gyula
>>>>
>>>>
>>>> Gyula
>>>>
>>>> On Thu, Feb 4, 2016 at 4:09 PM Stephan Ewen <se...@apache.org> wrote:
>>>>
>>>>> @Gyula Do you see log messages about quarantined actor systems?
>>>>>
>>>>> There may be an issue with Akka Death watches that once the connection
>>>>> is lost, it cannot be re-established unless the TaskManager is restarted
>>>>>
>>>>>
>>>>> http://doc.akka.io/docs/akka/current/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <radu.tudo...@huawei.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Well…yesterday when I looked into it there was no additional info
>>>>>> than the one I have send. Today I reproduced the problem and I could see 
>>>>>> in
>>>>>> the log file.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> akka.actor.ActorInitializationException: exception during creation
>>>>>>
>>>>>>         at
>>>>>> akka.actor.ActorInitializationException$.apply(Actor.scala:164)
>>>>>>
>>>>>>         at akka.actor.ActorCell.create(ActorCell.scala:596)
>>>>>>
>>>>>>         at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
>>>>>>
>>>>>>         at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>>>>>>
>>>>>>         at
>>>>>> akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
>>>>>>
>>>>>>         at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>>>>
>>>>>>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>>>>
>>>>>>         at
>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>
>>>>>>         at
>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>>>>>
>>>>>>         at
>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>>>>>
>>>>>>         at
>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>
>>>>>>         at
>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>>
>>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>>>
>>>>>>         at
>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>>>
>>>>>>         at
>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>>>>>
>>>>>>         at
>>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>
>>>>>>         at
>>>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>>>>>>
>>>>>>         at akka.util.Reflect$.instantiate(Reflect.scala:66)
>>>>>>
>>>>>>         at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)
>>>>>>
>>>>>>         at akka.actor.Props.newActor(Props.scala:252)
>>>>>>
>>>>>>         at akka.actor.ActorCell.newActor(ActorCell.scala:552)
>>>>>>
>>>>>>         at akka.actor.ActorCell.create(ActorCell.scala:578)
>>>>>>
>>>>>>         ... 10 more
>>>>>>
>>>>>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>>>
>>>>>> 11:21:17,423 ERROR
>>>>>> org.apache.flink.runtime.taskmanager.Task
>>>>>> - FATAL - exception in task resource cleanup
>>>>>>
>>>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>>>
>>>>>> 11:21:55,160 ERROR
>>>>>> org.apache.flink.runtime.taskmanager.Task
>>>>>>           - FATAL - exception in task exception handler
>>>>>>
>>>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>>>
>>>>>>
>>>>>>
>>>>>> ….
>>>>>>
>>>>>>
>>>>>>
>>>>>> - Unexpected exception in the selector loop.
>>>>>>
>>>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Looks like the input flow is faster than the GC collector
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr. Radu Tudoran
>>>>>>
>>>>>> Research Engineer - Big Data Expert
>>>>>>
>>>>>> IT R&D Division
>>>>>>
>>>>>>
>>>>>>
>>>>>> [image: cid:image007.jpg@01CD52EB.AD060EE0]
>>>>>>
>>>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>>>>>
>>>>>> European Research Center
>>>>>>
>>>>>> Riesstrasse 25, 80992 München
>>>>>>
>>>>>>
>>>>>>
>>>>>> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>*
>>>>>>
>>>>>> Mobile: +49 15209084330
>>>>>>
>>>>>> Telephone: +49 891588344173
>>>>>>
>>>>>>
>>>>>>
>>>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>>>>> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
>>>>>> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
>>>>>> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
>>>>>> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
>>>>>> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
>>>>>>
>>>>>> This e-mail and its attachments contain confidential information from
>>>>>> HUAWEI, which is intended only for the person or entity whose address is
>>>>>> listed above. Any use of the information contained herein in any way
>>>>>> (including, but not limited to, total or partial disclosure, 
>>>>>> reproduction,
>>>>>> or dissemination) by persons other than the intended recipient(s) is
>>>>>> prohibited. If you receive this e-mail in error, please notify the sender
>>>>>> by phone or email immediately and delete it!
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Till Rohrmann [mailto:trohrm...@apache.org]
>>>>>> *Sent:* Thursday, February 04, 2016 4:55 PM
>>>>>> *To:* user@flink.apache.org
>>>>>> *Subject:* Re: release of task slot
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Radu,
>>>>>>
>>>>>> what does the log of the TaskManager 10.204.62.80:57910 say?
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> ​
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <radu.tudo...@huawei.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am facing an error which for which I cannot figure the cause. Any
>>>>>> idea what could cause such an error?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> java.lang.Exception: The slot in which the task was executed has been
>>>>>> released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @
>>>>>> cloudr6-admin - 4 slots - URL: akka.tcp://
>>>>>> flink@10.204.62.80:57910/user/taskmanager
>>>>>>
>>>>>>         at
>>>>>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
>>>>>>
>>>>>>         at
>>>>>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>>>>>>
>>>>>>         at
>>>>>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>>>>>>
>>>>>>         at
>>>>>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
>>>>>>
>>>>>>         at
>>>>>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
>>>>>>
>>>>>>         at
>>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)
>>>>>>
>>>>>>         at
>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>>>>
>>>>>>         at
>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>>>>
>>>>>>         at
>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>>>>
>>>>>>         at
>>>>>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>>>>
>>>>>>         at
>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>>>>
>>>>>>         at
>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>>>>
>>>>>>         at
>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>>>>
>>>>>>         at
>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>>>>>
>>>>>>         at
>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>>>>>
>>>>>>         at
>>>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>>>>
>>>>>>         at
>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>>>>>
>>>>>>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>>>>
>>>>>>         at
>>>>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>>>>>
>>>>>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>>>
>>>>>>         at
>>>>>> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
>>>>>>
>>>>>>         at
>>>>>> akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>>>>>>
>>>>>>         at
>>>>>> akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>>>>>>
>>>>>>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>>>>>>
>>>>>>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>>>>
>>>>>>         at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>>>>
>>>>>>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>>>>
>>>>>>         at
>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>
>>>>>>         at
>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>>>
>>>>>>         at
>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>
>>>>>>         at
>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr. Radu Tudoran
>>>>>>
>>>>>> Research Engineer - Big Data Expert
>>>>>>
>>>>>> IT R&D Division
>>>>>>
>>>>>>
>>>>>>
>>>>>> [image: cid:image007.jpg@01CD52EB.AD060EE0]
>>>>>>
>>>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>>>>>
>>>>>> European Research Center
>>>>>>
>>>>>> Riesstrasse 25, 80992 München
>>>>>>
>>>>>>
>>>>>>
>>>>>> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>*
>>>>>>
>>>>>> Mobile: +49 15209084330
>>>>>>
>>>>>> Telephone: +49 891588344173
>>>>>>
>>>>>>
>>>>>>
>>>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>>>>> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
>>>>>> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
>>>>>> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
>>>>>> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
>>>>>> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
>>>>>>
>>>>>> This e-mail and its attachments contain confidential information from
>>>>>> HUAWEI, which is intended only for the person or entity whose address is
>>>>>> listed above. Any use of the information contained herein in any way
>>>>>> (including, but not limited to, total or partial disclosure, 
>>>>>> reproduction,
>>>>>> or dissemination) by persons other than the intended recipient(s) is
>>>>>> prohibited. If you receive this e-mail in error, please notify the sender
>>>>>> by phone or email immediately and delete it!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>

Reply via email to