Re: release of task slot
Hey, I am actually facing a similar issue lately, where the job manager release the task slots as it cannot contact the taskmanager. Meanwhile the taskmanager is also trying to connect to the Jobmanager and fails multiple times. This happens on multiple taskmanagers seemingly randomly. So the TM stays alive but the connection is lost. Maybe these are related. We are currently trying to debug this problem. Gyula Till Rohrmannezt írta (időpont: 2016. febr. 4., Cs, 15:55): > Hi Radu, > > what does the log of the TaskManager 10.204.62.80:57910 say? > > Cheers, > Till > > > On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran > wrote: > >> Hello, >> >> >> >> >> >> I am facing an error which for which I cannot figure the cause. Any idea >> what could cause such an error? >> >> >> >> >> >> >> >> java.lang.Exception: The slot in which the task was executed has been >> released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ >> cloudr6-admin - 4 slots - URL: akka.tcp:// >> flink@10.204.62.80:57910/user/taskmanager >> >> at >> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) >> >> at >> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) >> >> at >> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) >> >> at >> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) >> >> at >> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) >> >> at >> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696) >> >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >> >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >> >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >> >> at >> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >> >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >> >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >> >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >> >> at >> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >> >> at >> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >> >> at >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >> >> at >> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >> >> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >> >> at >> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >> >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >> >> at >> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) >> >> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) >> >> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) >> >> at akka.actor.ActorCell.invoke(ActorCell.scala:486) >> >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >> >> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >> >> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >> >> at >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> >> at >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> >> at >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> >> >> >> >> >> Dr. Radu Tudoran >> >> Research Engineer - Big Data Expert >> >> IT R Division >> >> >> >> [image: image001.png] >> >> HUAWEI TECHNOLOGIES Duesseldorf GmbH >> >> European Research Center >> >> Riesstrasse 25, 80992 München >> >> >> >> E-mail: *radu.tudo...@huawei.com * >> >> Mobile: +49 15209084330 >> >> Telephone: +49 891588344173 >> >> >> >> HUAWEI TECHNOLOGIES Duesseldorf GmbH >> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com >> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, >> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN >> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, >> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN >> >> This e-mail and its attachments contain confidential information from >> HUAWEI, which is intended only for the person or entity whose address is >> listed above. Any use of the
Re: release of task slot
Hi Radu, what does the log of the TaskManager 10.204.62.80:57910 say? Cheers, Till On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoranwrote: > Hello, > > > > > > I am facing an error which for which I cannot figure the cause. Any idea > what could cause such an error? > > > > > > > > java.lang.Exception: The slot in which the task was executed has been > released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ > cloudr6-admin - 4 slots - URL: akka.tcp:// > flink@10.204.62.80:57910/user/taskmanager > > at > org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) > > at > org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) > > at > org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) > > at > org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) > > at > org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) > > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696) > > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > > at > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > > at > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > > at > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > > at > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > > at > akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) > > at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) > > at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) > > at akka.actor.ActorCell.invoke(ActorCell.scala:486) > > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > > > > > Dr. Radu Tudoran > > Research Engineer - Big Data Expert > > IT R Division > > > > [image: cid:image007.jpg@01CD52EB.AD060EE0] > > HUAWEI TECHNOLOGIES Duesseldorf GmbH > > European Research Center > > Riesstrasse 25, 80992 München > > > > E-mail: *radu.tudo...@huawei.com * > > Mobile: +49 15209084330 > > Telephone: +49 891588344173 > > > > HUAWEI TECHNOLOGIES Duesseldorf GmbH > Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com > Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, > Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN > Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, > Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN > > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including, but not limited to, total or partial disclosure, reproduction, > or dissemination) by persons other than the intended recipient(s) is > prohibited. If you receive this e-mail in error, please notify the sender > by phone or email immediately and delete it! > > >
Re: release of task slot
Yes exactly , it says it is quarantined. Gyula Gyula On Thu, Feb 4, 2016 at 4:09 PM Stephan Ewen <se...@apache.org> wrote: > @Gyula Do you see log messages about quarantined actor systems? > > There may be an issue with Akka Death watches that once the connection is > lost, it cannot be re-established unless the TaskManager is restarted > > > http://doc.akka.io/docs/akka/current/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model > > > > On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <radu.tudo...@huawei.com> > wrote: > >> Hi, >> >> >> >> Well…yesterday when I looked into it there was no additional info than >> the one I have send. Today I reproduced the problem and I could see in the >> log file. >> >> >> >> >> >> akka.actor.ActorInitializationException: exception during creation >> >> at akka.actor.ActorInitializationException$.apply(Actor.scala:164) >> >> at akka.actor.ActorCell.create(ActorCell.scala:596) >> >> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) >> >> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) >> >> at >> akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279) >> >> at akka.dispatch.Mailbox.run(Mailbox.scala:220) >> >> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >> >> at >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >> >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >> >> at >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> >> at >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> >> Caused by: java.lang.reflect.InvocationTargetException >> >> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >> Method) >> >> at >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) >> >> at >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >> >> at java.lang.reflect.Constructor.newInstance(Constructor.java:422) >> >> at akka.util.Reflect$.instantiate(Reflect.scala:66) >> >> at akka.actor.ArgsReflectConstructor.produce(Props.scala:352) >> >> at akka.actor.Props.newActor(Props.scala:252) >> >> at akka.actor.ActorCell.newActor(ActorCell.scala:552) >> >> at akka.actor.ActorCell.create(ActorCell.scala:578) >> >> ... 10 more >> >> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded >> >> 11:21:17,423 ERROR >> org.apache.flink.runtime.taskmanager.Task >> - FATAL - exception in task resource cleanup >> >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> >> 11:21:55,160 ERROR >> org.apache.flink.runtime.taskmanager.Task >> - FATAL - exception in task exception handler >> >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> >> >> >> …. >> >> >> >> - Unexpected exception in the selector loop. >> >> java.lang.OutOfMemoryError: GC overhead limit exceeded >> >> >> >> >> >> Looks like the input flow is faster than the GC collector >> >> >> >> Dr. Radu Tudoran >> >> Research Engineer - Big Data Expert >> >> IT R Division >> >> >> >> [image: cid:image007.jpg@01CD52EB.AD060EE0] >> >> HUAWEI TECHNOLOGIES Duesseldorf GmbH >> >> European Research Center >> >> Riesstrasse 25, 80992 München >> >> >> >> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>* >> >> Mobile: +49 15209084330 >> >> Telephone: +49 891588344173 >> >> >> >> HUAWEI TECHNOLOGIES Duesseldorf GmbH >> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com >> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, >> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN >> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, >> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN >> >> This e-mail and its attachments contain confidential information from >> HUAWEI, which i
RE: release of task slot
Hi, Well…yesterday when I looked into it there was no additional info than the one I have send. Today I reproduced the problem and I could see in the log file. akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at akka.util.Reflect$.instantiate(Reflect.scala:66) at akka.actor.ArgsReflectConstructor.produce(Props.scala:352) at akka.actor.Props.newActor(Props.scala:252) at akka.actor.ActorCell.newActor(ActorCell.scala:552) at akka.actor.ActorCell.create(ActorCell.scala:578) ... 10 more Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded 11:21:17,423 ERROR org.apache.flink.runtime.taskmanager.Task - FATAL - exception in task resource cleanup java.lang.OutOfMemoryError: GC overhead limit exceeded 11:21:55,160 ERROR org.apache.flink.runtime.taskmanager.Task - FATAL - exception in task exception handler java.lang.OutOfMemoryError: GC overhead limit exceeded …. - Unexpected exception in the selector loop. java.lang.OutOfMemoryError: GC overhead limit exceeded Looks like the input flow is faster than the GC collector Dr. Radu Tudoran Research Engineer - Big Data Expert IT R Division [cid:image007.jpg@01CD52EB.AD060EE0] HUAWEI TECHNOLOGIES Duesseldorf GmbH European Research Center Riesstrasse 25, 80992 München E-mail: radu.tudo...@huawei.com Mobile: +49 15209084330 Telephone: +49 891588344173 HUAWEI TECHNOLOGIES Duesseldorf GmbH Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com<http://www.huawei.com/> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! From: Till Rohrmann [mailto:trohrm...@apache.org] Sent: Thursday, February 04, 2016 4:55 PM To: user@flink.apache.org Subject: Re: release of task slot Hi Radu, what does the log of the TaskManager 10.204.62.80:57910<http://10.204.62.80:57910> say? Cheers, Till On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <radu.tudo...@huawei.com<mailto:radu.tudo...@huawei.com>> wrote: Hello, I am facing an error which for which I cannot figure the cause. Any idea what could cause such an error? java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin - 4 slots - URL: akka.tcp://flink@10.204.62.80:57910/user/taskmanager<http://flink@10.204.62.80:57910/user/taskmanager> at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) at org.apache.flink.runtime.instance.SharedSlot.r
Re: release of task slot
>>>> >>>> 11:21:17,423 ERROR >>>> org.apache.flink.runtime.taskmanager.Task >>>> - FATAL - exception in task resource cleanup >>>> >>>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>>> >>>> 11:21:55,160 ERROR >>>> org.apache.flink.runtime.taskmanager.Task >>>> - FATAL - exception in task exception handler >>>> >>>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>>> >>>> >>>> >>>> …. >>>> >>>> >>>> >>>> - Unexpected exception in the selector loop. >>>> >>>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>>> >>>> >>>> >>>> >>>> >>>> Looks like the input flow is faster than the GC collector >>>> >>>> >>>> >>>> Dr. Radu Tudoran >>>> >>>> Research Engineer - Big Data Expert >>>> >>>> IT R Division >>>> >>>> >>>> >>>> [image: cid:image007.jpg@01CD52EB.AD060EE0] >>>> >>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH >>>> >>>> European Research Center >>>> >>>> Riesstrasse 25, 80992 München >>>> >>>> >>>> >>>> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>* >>>> >>>> Mobile: +49 15209084330 >>>> >>>> Telephone: +49 891588344173 >>>> >>>> >>>> >>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH >>>> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com >>>> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, >>>> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN >>>> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, >>>> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN >>>> >>>> This e-mail and its attachments contain confidential information from >>>> HUAWEI, which is intended only for the person or entity whose address is >>>> listed above. Any use of the information contained herein in any way >>>> (including, but not limited to, total or partial disclosure, reproduction, >>>> or dissemination) by persons other than the intended recipient(s) is >>>> prohibited. If you receive this e-mail in error, please notify the sender >>>> by phone or email immediately and delete it! >>>> >>>> >>>> >>>> *From:* Till Rohrmann [mailto:trohrm...@apache.org] >>>> *Sent:* Thursday, February 04, 2016 4:55 PM >>>> *To:* user@flink.apache.org >>>> *Subject:* Re: release of task slot >>>> >>>> >>>> >>>> Hi Radu, >>>> >>>> what does the log of the TaskManager 10.204.62.80:57910 say? >>>> >>>> Cheers, >>>> Till >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <radu.tudo...@huawei.com> >>>> wrote: >>>> >>>> Hello, >>>> >>>> >>>> >>>> >>>> >>>> I am facing an error which for which I cannot figure the cause. Any >>>> idea what could cause such an error? >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> java.lang.Exception: The slot in which the task was executed has been >>>> released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ >>>> cloudr6-admin - 4 slots - URL: akka.tcp:// >>>> flink@10.204.62.80:57910/user/taskmanager >>>> >>>> at >>>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) >>>> >>>> at >>>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) >>>> >>>> at >>>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) >>>> >>>> at >>>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) >>>> >>>> at >>>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
Re: release of task slot
@Gyula Do you see log messages about quarantined actor systems? There may be an issue with Akka Death watches that once the connection is lost, it cannot be re-established unless the TaskManager is restarted http://doc.akka.io/docs/akka/current/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <radu.tudo...@huawei.com> wrote: > Hi, > > > > Well…yesterday when I looked into it there was no additional info than the > one I have send. Today I reproduced the problem and I could see in the log > file. > > > > > > akka.actor.ActorInitializationException: exception during creation > > at akka.actor.ActorInitializationException$.apply(Actor.scala:164) > > at akka.actor.ActorCell.create(ActorCell.scala:596) > > at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) > > at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) > > at > akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279) > > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) > > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) > > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > Caused by: java.lang.reflect.InvocationTargetException > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > > at akka.util.Reflect$.instantiate(Reflect.scala:66) > > at akka.actor.ArgsReflectConstructor.produce(Props.scala:352) > > at akka.actor.Props.newActor(Props.scala:252) > > at akka.actor.ActorCell.newActor(ActorCell.scala:552) > > at akka.actor.ActorCell.create(ActorCell.scala:578) > > ... 10 more > > Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded > > 11:21:17,423 ERROR > org.apache.flink.runtime.taskmanager.Task > - FATAL - exception in task resource cleanup > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > 11:21:55,160 ERROR > org.apache.flink.runtime.taskmanager.Task > - FATAL - exception in task exception handler > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > > > …. > > > > - Unexpected exception in the selector loop. > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > > > > > Looks like the input flow is faster than the GC collector > > > > Dr. Radu Tudoran > > Research Engineer - Big Data Expert > > IT R Division > > > > [image: cid:image007.jpg@01CD52EB.AD060EE0] > > HUAWEI TECHNOLOGIES Duesseldorf GmbH > > European Research Center > > Riesstrasse 25, 80992 München > > > > E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>* > > Mobile: +49 15209084330 > > Telephone: +49 891588344173 > > > > HUAWEI TECHNOLOGIES Duesseldorf GmbH > Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com > Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, > Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN > Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, > Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN > > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including, but not limited to, total or partial disclosure, reproduction, > or dissemination) by persons other than the intended recipient(s) is > prohibited. If you receive this e-mail in error, please notify the sender > by phone or email immediately and delete it! > > > > *From:* Till Rohrmann [mailto:trohrm...@apache.org] > *Sent:* Thursday, February 04, 2016 4:55 PM > *To:* user@flink.apache.org > *Subject:* Re: release of task slot > > > > Hi Radu, > > what does the log of the TaskManager 10.204.62.80:57910 say? > > Cheers, &
Re: release of task slot
t;> >>> >>> Looks like the input flow is faster than the GC collector >>> >>> >>> >>> Dr. Radu Tudoran >>> >>> Research Engineer - Big Data Expert >>> >>> IT R Division >>> >>> >>> >>> [image: cid:image007.jpg@01CD52EB.AD060EE0] >>> >>> HUAWEI TECHNOLOGIES Duesseldorf GmbH >>> >>> European Research Center >>> >>> Riesstrasse 25, 80992 München >>> >>> >>> >>> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>* >>> >>> Mobile: +49 15209084330 >>> >>> Telephone: +49 891588344173 >>> >>> >>> >>> HUAWEI TECHNOLOGIES Duesseldorf GmbH >>> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com >>> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, >>> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN >>> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, >>> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN >>> >>> This e-mail and its attachments contain confidential information from >>> HUAWEI, which is intended only for the person or entity whose address is >>> listed above. Any use of the information contained herein in any way >>> (including, but not limited to, total or partial disclosure, reproduction, >>> or dissemination) by persons other than the intended recipient(s) is >>> prohibited. If you receive this e-mail in error, please notify the sender >>> by phone or email immediately and delete it! >>> >>> >>> >>> *From:* Till Rohrmann [mailto:trohrm...@apache.org] >>> *Sent:* Thursday, February 04, 2016 4:55 PM >>> *To:* user@flink.apache.org >>> *Subject:* Re: release of task slot >>> >>> >>> >>> Hi Radu, >>> >>> what does the log of the TaskManager 10.204.62.80:57910 say? >>> >>> Cheers, >>> Till >>> >>> >>> >>> >>> >>> On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <radu.tudo...@huawei.com> >>> wrote: >>> >>> Hello, >>> >>> >>> >>> >>> >>> I am facing an error which for which I cannot figure the cause. Any idea >>> what could cause such an error? >>> >>> >>> >>> >>> >>> >>> >>> java.lang.Exception: The slot in which the task was executed has been >>> released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ >>> cloudr6-admin - 4 slots - URL: akka.tcp:// >>> flink@10.204.62.80:57910/user/taskmanager >>> >>> at >>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) >>> >>> at >>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) >>> >>> at >>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) >>> >>> at >>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) >>> >>> at >>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) >>> >>> at >>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696) >>> >>> at >>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>> >>> at >>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>> >>> at >>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>> >>> at >>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >>> >>> at >>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>> >>> at >>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>> >>> at >>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>> >>> at >>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >>