I am not an expert on this but I guess either could work. What is also interesting that sometimes this doesn't happen for a day sometimes it happens twice an hour so it's probably some network issue as well.
Gyula Stephan Ewen <se...@apache.org> ezt írta (időpont: 2016. febr. 4., Cs, 16:32): > We should probably add to the TaskManager a "restart on quarantined" > strategy anyways. > > We can detect it as follows: > http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state > > > On Thu, Feb 4, 2016 at 5:18 PM, Stephan Ewen <se...@apache.org> wrote: > >> Okay, here are the docs for the Akka version we are using: >> http://doc.akka.io/docs/akka/2.3.14/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model >> >> It says that after a remote deathwatch trigger, the actor system must be >> restarted before it can connect again. >> >> We probably need to do the following: >> - Either restart TaskManager actor system when it detects that it is >> quarantined (or when it senses the JobManager as failed) >> >> - Or switch from Akka death watch to a manual heartbeat mechanism >> >> Would be good to also have Till's input on this... >> >> What do you think? >> >> Stephan >> >> >> >> On Thu, Feb 4, 2016 at 5:11 PM, Gyula Fóra <gyula.f...@gmail.com> wrote: >> >>> Yes exactly , it says it is quarantined. >>> >>> Gyula >>> >>> >>> Gyula >>> >>> On Thu, Feb 4, 2016 at 4:09 PM Stephan Ewen <se...@apache.org> wrote: >>> >>>> @Gyula Do you see log messages about quarantined actor systems? >>>> >>>> There may be an issue with Akka Death watches that once the connection >>>> is lost, it cannot be re-established unless the TaskManager is restarted >>>> >>>> >>>> http://doc.akka.io/docs/akka/current/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model >>>> >>>> >>>> >>>> On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <radu.tudo...@huawei.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> >>>>> >>>>> Well…yesterday when I looked into it there was no additional info than >>>>> the one I have send. Today I reproduced the problem and I could see in the >>>>> log file. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> akka.actor.ActorInitializationException: exception during creation >>>>> >>>>> at >>>>> akka.actor.ActorInitializationException$.apply(Actor.scala:164) >>>>> >>>>> at akka.actor.ActorCell.create(ActorCell.scala:596) >>>>> >>>>> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) >>>>> >>>>> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) >>>>> >>>>> at >>>>> akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279) >>>>> >>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:220) >>>>> >>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>>>> >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>>> >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >>>>> >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >>>>> >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>>> >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>>> >>>>> Caused by: java.lang.reflect.InvocationTargetException >>>>> >>>>> at >>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >>>>> >>>>> at >>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) >>>>> >>>>> at >>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >>>>> >>>>> at >>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:422) >>>>> >>>>> at akka.util.Reflect$.instantiate(Reflect.scala:66) >>>>> >>>>> at akka.actor.ArgsReflectConstructor.produce(Props.scala:352) >>>>> >>>>> at akka.actor.Props.newActor(Props.scala:252) >>>>> >>>>> at akka.actor.ActorCell.newActor(ActorCell.scala:552) >>>>> >>>>> at akka.actor.ActorCell.create(ActorCell.scala:578) >>>>> >>>>> ... 10 more >>>>> >>>>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded >>>>> >>>>> 11:21:17,423 ERROR >>>>> org.apache.flink.runtime.taskmanager.Task >>>>> - FATAL - exception in task resource cleanup >>>>> >>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>>>> >>>>> 11:21:55,160 ERROR >>>>> org.apache.flink.runtime.taskmanager.Task >>>>> - FATAL - exception in task exception handler >>>>> >>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>>>> >>>>> >>>>> >>>>> …. >>>>> >>>>> >>>>> >>>>> - Unexpected exception in the selector loop. >>>>> >>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Looks like the input flow is faster than the GC collector >>>>> >>>>> >>>>> >>>>> Dr. Radu Tudoran >>>>> >>>>> Research Engineer - Big Data Expert >>>>> >>>>> IT R&D Division >>>>> >>>>> >>>>> >>>>> [image: cid:image007.jpg@01CD52EB.AD060EE0] >>>>> >>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH >>>>> >>>>> European Research Center >>>>> >>>>> Riesstrasse 25, 80992 München >>>>> >>>>> >>>>> >>>>> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>* >>>>> >>>>> Mobile: +49 15209084330 >>>>> >>>>> Telephone: +49 891588344173 >>>>> >>>>> >>>>> >>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH >>>>> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com >>>>> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, >>>>> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN >>>>> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, >>>>> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN >>>>> >>>>> This e-mail and its attachments contain confidential information from >>>>> HUAWEI, which is intended only for the person or entity whose address is >>>>> listed above. Any use of the information contained herein in any way >>>>> (including, but not limited to, total or partial disclosure, reproduction, >>>>> or dissemination) by persons other than the intended recipient(s) is >>>>> prohibited. If you receive this e-mail in error, please notify the sender >>>>> by phone or email immediately and delete it! >>>>> >>>>> >>>>> >>>>> *From:* Till Rohrmann [mailto:trohrm...@apache.org] >>>>> *Sent:* Thursday, February 04, 2016 4:55 PM >>>>> *To:* user@flink.apache.org >>>>> *Subject:* Re: release of task slot >>>>> >>>>> >>>>> >>>>> Hi Radu, >>>>> >>>>> what does the log of the TaskManager 10.204.62.80:57910 say? >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <radu.tudo...@huawei.com> >>>>> wrote: >>>>> >>>>> Hello, >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> I am facing an error which for which I cannot figure the cause. Any >>>>> idea what could cause such an error? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> java.lang.Exception: The slot in which the task was executed has been >>>>> released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ >>>>> cloudr6-admin - 4 slots - URL: akka.tcp:// >>>>> flink@10.204.62.80:57910/user/taskmanager >>>>> >>>>> at >>>>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) >>>>> >>>>> at >>>>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) >>>>> >>>>> at >>>>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) >>>>> >>>>> at >>>>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) >>>>> >>>>> at >>>>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) >>>>> >>>>> at >>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696) >>>>> >>>>> at >>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>>>> >>>>> at >>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>>>> >>>>> at >>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>>>> >>>>> at >>>>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >>>>> >>>>> at >>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>>>> >>>>> at >>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>>>> >>>>> at >>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>>>> >>>>> at >>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >>>>> >>>>> at >>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >>>>> >>>>> at >>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>>>> >>>>> at >>>>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >>>>> >>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>>>> >>>>> at >>>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >>>>> >>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>>>> >>>>> at >>>>> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) >>>>> >>>>> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) >>>>> >>>>> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) >>>>> >>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:486) >>>>> >>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >>>>> >>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >>>>> >>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>>>> >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>>> >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>>> >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>>> >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Dr. Radu Tudoran >>>>> >>>>> Research Engineer - Big Data Expert >>>>> >>>>> IT R&D Division >>>>> >>>>> >>>>> >>>>> [image: cid:image007.jpg@01CD52EB.AD060EE0] >>>>> >>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH >>>>> >>>>> European Research Center >>>>> >>>>> Riesstrasse 25, 80992 München >>>>> >>>>> >>>>> >>>>> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>* >>>>> >>>>> Mobile: +49 15209084330 >>>>> >>>>> Telephone: +49 891588344173 >>>>> >>>>> >>>>> >>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH >>>>> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com >>>>> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, >>>>> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN >>>>> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063, >>>>> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN >>>>> >>>>> This e-mail and its attachments contain confidential information from >>>>> HUAWEI, which is intended only for the person or entity whose address is >>>>> listed above. Any use of the information contained herein in any way >>>>> (including, but not limited to, total or partial disclosure, reproduction, >>>>> or dissemination) by persons other than the intended recipient(s) is >>>>> prohibited. If you receive this e-mail in error, please notify the sender >>>>> by phone or email immediately and delete it! >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >> >