Re: release of task slot

Gyula Fóra Fri, 05 Feb 2016 00:28:21 -0800

I am not an expert on this but I guess either could work. What is also
interesting that sometimes this doesn't happen for a day sometimes it
happens twice an hour so it's probably some network issue as well.


Gyula


Stephan Ewen <se...@apache.org> ezt írta (időpont: 2016. febr. 4., Cs,
16:32):

> We should probably add to the TaskManager a "restart on quarantined"
> strategy anyways.
>
> We can detect it as follows:
> http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state
>
>
> On Thu, Feb 4, 2016 at 5:18 PM, Stephan Ewen <se...@apache.org> wrote:
>
>> Okay, here are the docs for the Akka version we are using:
>> http://doc.akka.io/docs/akka/2.3.14/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model
>>
>> It says that after a remote deathwatch trigger, the actor system must be
>> restarted before it can connect again.
>>
>> We probably need to do the following:
>>   - Either restart TaskManager actor system when it detects that it is
>> quarantined (or when it senses the JobManager as failed)
>>
>>   - Or switch from Akka death watch to a manual heartbeat mechanism
>>
>> Would be good to also have Till's input on this...
>>
>> What do you think?
>>
>> Stephan
>>
>>
>>
>> On Thu, Feb 4, 2016 at 5:11 PM, Gyula Fóra <gyula.f...@gmail.com> wrote:
>>
>>> Yes exactly , it says it is quarantined.
>>>
>>> Gyula
>>>
>>>
>>> Gyula
>>>
>>> On Thu, Feb 4, 2016 at 4:09 PM Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> @Gyula Do you see log messages about quarantined actor systems?
>>>>
>>>> There may be an issue with Akka Death watches that once the connection
>>>> is lost, it cannot be re-established unless the TaskManager is restarted
>>>>
>>>>
>>>> http://doc.akka.io/docs/akka/current/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model
>>>>
>>>>
>>>>
>>>> On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <radu.tudo...@huawei.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> Well…yesterday when I looked into it there was no additional info than
>>>>> the one I have send. Today I reproduced the problem and I could see in the
>>>>> log file.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> akka.actor.ActorInitializationException: exception during creation
>>>>>
>>>>>         at
>>>>> akka.actor.ActorInitializationException$.apply(Actor.scala:164)
>>>>>
>>>>>         at akka.actor.ActorCell.create(ActorCell.scala:596)
>>>>>
>>>>>         at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
>>>>>
>>>>>         at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>>>>>
>>>>>         at
>>>>> akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
>>>>>
>>>>>         at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>>>
>>>>>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>>>
>>>>>         at
>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>
>>>>>         at
>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>>>>
>>>>>         at
>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>>>>
>>>>>         at
>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>
>>>>>         at
>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>
>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>>
>>>>>         at
>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>>
>>>>>         at
>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>>>>
>>>>>         at
>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>
>>>>>         at
>>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>>>>>
>>>>>         at akka.util.Reflect$.instantiate(Reflect.scala:66)
>>>>>
>>>>>         at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)
>>>>>
>>>>>         at akka.actor.Props.newActor(Props.scala:252)
>>>>>
>>>>>         at akka.actor.ActorCell.newActor(ActorCell.scala:552)
>>>>>
>>>>>         at akka.actor.ActorCell.create(ActorCell.scala:578)
>>>>>
>>>>>         ... 10 more
>>>>>
>>>>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>>
>>>>> 11:21:17,423 ERROR
>>>>> org.apache.flink.runtime.taskmanager.Task
>>>>> - FATAL - exception in task resource cleanup
>>>>>
>>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>>
>>>>> 11:21:55,160 ERROR
>>>>> org.apache.flink.runtime.taskmanager.Task
>>>>>           - FATAL - exception in task exception handler
>>>>>
>>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>>
>>>>>
>>>>>
>>>>> ….
>>>>>
>>>>>
>>>>>
>>>>> - Unexpected exception in the selector loop.
>>>>>
>>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Looks like the input flow is faster than the GC collector
>>>>>
>>>>>
>>>>>
>>>>> Dr. Radu Tudoran
>>>>>
>>>>> Research Engineer - Big Data Expert
>>>>>
>>>>> IT R&D Division
>>>>>
>>>>>
>>>>>
>>>>> [image: cid:image007.jpg@01CD52EB.AD060EE0]
>>>>>
>>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>>>>
>>>>> European Research Center
>>>>>
>>>>> Riesstrasse 25, 80992 München
>>>>>
>>>>>
>>>>>
>>>>> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>*
>>>>>
>>>>> Mobile: +49 15209084330
>>>>>
>>>>> Telephone: +49 891588344173
>>>>>
>>>>>
>>>>>
>>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>>>> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
>>>>> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
>>>>> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
>>>>> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
>>>>> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
>>>>>
>>>>> This e-mail and its attachments contain confidential information from
>>>>> HUAWEI, which is intended only for the person or entity whose address is
>>>>> listed above. Any use of the information contained herein in any way
>>>>> (including, but not limited to, total or partial disclosure, reproduction,
>>>>> or dissemination) by persons other than the intended recipient(s) is
>>>>> prohibited. If you receive this e-mail in error, please notify the sender
>>>>> by phone or email immediately and delete it!
>>>>>
>>>>>
>>>>>
>>>>> *From:* Till Rohrmann [mailto:trohrm...@apache.org]
>>>>> *Sent:* Thursday, February 04, 2016 4:55 PM
>>>>> *To:* user@flink.apache.org
>>>>> *Subject:* Re: release of task slot
>>>>>
>>>>>
>>>>>
>>>>> Hi Radu,
>>>>>
>>>>> what does the log of the TaskManager 10.204.62.80:57910 say?
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> 
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <radu.tudo...@huawei.com>
>>>>> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I am facing an error which for which I cannot figure the cause. Any
>>>>> idea what could cause such an error?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> java.lang.Exception: The slot in which the task was executed has been
>>>>> released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @
>>>>> cloudr6-admin - 4 slots - URL: akka.tcp://
>>>>> flink@10.204.62.80:57910/user/taskmanager
>>>>>
>>>>>         at
>>>>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
>>>>>
>>>>>         at
>>>>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>>>>>
>>>>>         at
>>>>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>>>>>
>>>>>         at
>>>>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
>>>>>
>>>>>         at
>>>>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
>>>>>
>>>>>         at
>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)
>>>>>
>>>>>         at
>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>>>
>>>>>         at
>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>>>
>>>>>         at
>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>>>
>>>>>         at
>>>>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>>>
>>>>>         at
>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>>>
>>>>>         at
>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>>>
>>>>>         at
>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>>>
>>>>>         at
>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>>>>
>>>>>         at
>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>>>>
>>>>>         at
>>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>>>
>>>>>         at
>>>>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>>>>
>>>>>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>>>
>>>>>         at
>>>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>>>>
>>>>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>>
>>>>>         at
>>>>> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
>>>>>
>>>>>         at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>>>>>
>>>>>         at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>>>>>
>>>>>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>>>>>
>>>>>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>>>
>>>>>         at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>>>
>>>>>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>>>
>>>>>         at
>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>
>>>>>         at
>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>>
>>>>>         at
>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>
>>>>>         at
>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr. Radu Tudoran
>>>>>
>>>>> Research Engineer - Big Data Expert
>>>>>
>>>>> IT R&D Division
>>>>>
>>>>>
>>>>>
>>>>> [image: cid:image007.jpg@01CD52EB.AD060EE0]
>>>>>
>>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>>>>
>>>>> European Research Center
>>>>>
>>>>> Riesstrasse 25, 80992 München
>>>>>
>>>>>
>>>>>
>>>>> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>*
>>>>>
>>>>> Mobile: +49 15209084330
>>>>>
>>>>> Telephone: +49 891588344173
>>>>>
>>>>>
>>>>>
>>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>>>> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
>>>>> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
>>>>> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
>>>>> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
>>>>> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
>>>>>
>>>>> This e-mail and its attachments contain confidential information from
>>>>> HUAWEI, which is intended only for the person or entity whose address is
>>>>> listed above. Any use of the information contained herein in any way
>>>>> (including, but not limited to, total or partial disclosure, reproduction,
>>>>> or dissemination) by persons other than the intended recipient(s) is
>>>>> prohibited. If you receive this e-mail in error, please notify the sender
>>>>> by phone or email immediately and delete it!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>
>

Re: release of task slot

Reply via email to