Re: release of task slot

2016-02-04 Thread Gyula Fóra
Hey,

I am actually facing a similar issue lately, where the job manager release
the task slots as it cannot contact the taskmanager.

Meanwhile the taskmanager is also trying to connect to the Jobmanager and
fails multiple times. This happens on multiple taskmanagers seemingly
randomly. So the TM stays alive but the connection is lost.

Maybe these are related. We are currently trying to debug this problem.

Gyula

Till Rohrmann  ezt írta (időpont: 2016. febr. 4., Cs,
15:55):

> Hi Radu,
>
> what does the log of the TaskManager 10.204.62.80:57910 say?
>
> Cheers,
> Till
> ​
>
> On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran 
> wrote:
>
>> Hello,
>>
>>
>>
>>
>>
>> I am facing an error which for which I cannot figure the cause. Any idea
>> what could cause such an error?
>>
>>
>>
>>
>>
>>
>>
>> java.lang.Exception: The slot in which the task was executed has been
>> released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @
>> cloudr6-admin - 4 slots - URL: akka.tcp://
>> flink@10.204.62.80:57910/user/taskmanager
>>
>> at
>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
>>
>> at
>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>>
>> at
>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>>
>> at
>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
>>
>> at
>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
>>
>> at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>
>> at
>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>
>> at
>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>
>> at
>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>
>> at
>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>
>> at
>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>
>> at
>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>
>> at
>> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
>>
>> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>>
>> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>>
>> at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>>
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>
>> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>
>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>>
>>
>>
>>
>> Dr. Radu Tudoran
>>
>> Research Engineer - Big Data Expert
>>
>> IT R Division
>>
>>
>>
>> [image: image001.png]
>>
>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>
>> European Research Center
>>
>> Riesstrasse 25, 80992 München
>>
>>
>>
>> E-mail: *radu.tudo...@huawei.com *
>>
>> Mobile: +49 15209084330
>>
>> Telephone: +49 891588344173
>>
>>
>>
>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
>> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
>> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
>> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
>> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
>>
>> This e-mail and its attachments contain confidential information from
>> HUAWEI, which is intended only for the person or entity whose address is
>> listed above. Any use of the 

Re: release of task slot

2016-02-04 Thread Till Rohrmann
Hi Radu,

what does the log of the TaskManager 10.204.62.80:57910 say?

Cheers,
Till
​

On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran 
wrote:

> Hello,
>
>
>
>
>
> I am facing an error which for which I cannot figure the cause. Any idea
> what could cause such an error?
>
>
>
>
>
>
>
> java.lang.Exception: The slot in which the task was executed has been
> released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @
> cloudr6-admin - 4 slots - URL: akka.tcp://
> flink@10.204.62.80:57910/user/taskmanager
>
> at
> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
>
> at
> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>
> at
> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>
> at
> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
>
> at
> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
>
> at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)
>
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>
> at
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>
> at
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>
> at
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>
> at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>
> at
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>
> at
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>
> at
> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
>
> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>
> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>
> at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>
> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>
> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>
>
>
> Dr. Radu Tudoran
>
> Research Engineer - Big Data Expert
>
> IT R Division
>
>
>
> [image: cid:image007.jpg@01CD52EB.AD060EE0]
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>
> European Research Center
>
> Riesstrasse 25, 80992 München
>
>
>
> E-mail: *radu.tudo...@huawei.com *
>
> Mobile: +49 15209084330
>
> Telephone: +49 891588344173
>
>
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
>
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>
>
>


Re: release of task slot

2016-02-04 Thread Gyula Fóra
Yes exactly , it says it is quarantined.

Gyula


Gyula
On Thu, Feb 4, 2016 at 4:09 PM Stephan Ewen <se...@apache.org> wrote:

> @Gyula Do you see log messages about quarantined actor systems?
>
> There may be an issue with Akka Death watches that once the connection is
> lost, it cannot be re-established unless the TaskManager is restarted
>
>
> http://doc.akka.io/docs/akka/current/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model
>
>
>
> On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <radu.tudo...@huawei.com>
> wrote:
>
>> Hi,
>>
>>
>>
>> Well…yesterday when I looked into it there was no additional info than
>> the one I have send. Today I reproduced the problem and I could see in the
>> log file.
>>
>>
>>
>>
>>
>> akka.actor.ActorInitializationException: exception during creation
>>
>> at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
>>
>> at akka.actor.ActorCell.create(ActorCell.scala:596)
>>
>> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
>>
>> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>>
>> at
>> akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
>>
>> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>
>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>> Caused by: java.lang.reflect.InvocationTargetException
>>
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>
>> at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>>
>> at akka.util.Reflect$.instantiate(Reflect.scala:66)
>>
>> at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)
>>
>> at akka.actor.Props.newActor(Props.scala:252)
>>
>> at akka.actor.ActorCell.newActor(ActorCell.scala:552)
>>
>> at akka.actor.ActorCell.create(ActorCell.scala:578)
>>
>> ... 10 more
>>
>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>> 11:21:17,423 ERROR
>> org.apache.flink.runtime.taskmanager.Task
>> - FATAL - exception in task resource cleanup
>>
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>> 11:21:55,160 ERROR
>> org.apache.flink.runtime.taskmanager.Task
>>   - FATAL - exception in task exception handler
>>
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>>
>>
>> ….
>>
>>
>>
>> - Unexpected exception in the selector loop.
>>
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>>
>>
>>
>>
>> Looks like the input flow is faster than the GC collector
>>
>>
>>
>> Dr. Radu Tudoran
>>
>> Research Engineer - Big Data Expert
>>
>> IT R Division
>>
>>
>>
>> [image: cid:image007.jpg@01CD52EB.AD060EE0]
>>
>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>
>> European Research Center
>>
>> Riesstrasse 25, 80992 München
>>
>>
>>
>> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>*
>>
>> Mobile: +49 15209084330
>>
>> Telephone: +49 891588344173
>>
>>
>>
>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
>> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
>> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
>> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
>> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
>>
>> This e-mail and its attachments contain confidential information from
>> HUAWEI, which i

RE: release of task slot

2016-02-04 Thread Radu Tudoran
Hi,

Well…yesterday when I looked into it there was no additional info than the one 
I have send. Today I reproduced the problem and I could see in the log file.


akka.actor.ActorInitializationException: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
at akka.actor.ActorCell.create(ActorCell.scala:596)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at akka.util.Reflect$.instantiate(Reflect.scala:66)
at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)
at akka.actor.Props.newActor(Props.scala:252)
at akka.actor.ActorCell.newActor(ActorCell.scala:552)
at akka.actor.ActorCell.create(ActorCell.scala:578)
... 10 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
11:21:17,423 ERROR org.apache.flink.runtime.taskmanager.Task

- FATAL - exception in task 
resource cleanup
java.lang.OutOfMemoryError: GC overhead limit exceeded
11:21:55,160 ERROR org.apache.flink.runtime.taskmanager.Task

- FATAL - exception in task 
exception handler
java.lang.OutOfMemoryError: GC overhead limit exceeded

….

- Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: GC overhead limit exceeded


Looks like the input flow is faster than the GC collector

Dr. Radu Tudoran
Research Engineer - Big Data Expert
IT R Division

[cid:image007.jpg@01CD52EB.AD060EE0]
HUAWEI TECHNOLOGIES Duesseldorf GmbH
European Research Center
Riesstrasse 25, 80992 München

E-mail: radu.tudo...@huawei.com
Mobile: +49 15209084330
Telephone: +49 891588344173

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany, 
www.huawei.com<http://www.huawei.com/>
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
This e-mail and its attachments contain confidential information from HUAWEI, 
which is intended only for the person or entity whose address is listed above. 
Any use of the information contained herein in any way (including, but not 
limited to, total or partial disclosure, reproduction, or dissemination) by 
persons other than the intended recipient(s) is prohibited. If you receive this 
e-mail in error, please notify the sender by phone or email immediately and 
delete it!

From: Till Rohrmann [mailto:trohrm...@apache.org]
Sent: Thursday, February 04, 2016 4:55 PM
To: user@flink.apache.org
Subject: Re: release of task slot


Hi Radu,

what does the log of the TaskManager 
10.204.62.80:57910<http://10.204.62.80:57910> say?

Cheers,
Till
​

On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran 
<radu.tudo...@huawei.com<mailto:radu.tudo...@huawei.com>> wrote:
Hello,


I am facing an error which for which I cannot figure the cause. Any idea what 
could cause such an error?



java.lang.Exception: The slot in which the task was executed has been released. 
Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin - 
4 slots - URL: 
akka.tcp://flink@10.204.62.80:57910/user/taskmanager<http://flink@10.204.62.80:57910/user/taskmanager>
at 
org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
at 
org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
at 
org.apache.flink.runtime.instance.SharedSlot.r

Re: release of task slot

2016-02-04 Thread Stephan Ewen
>>>>
>>>> 11:21:17,423 ERROR
>>>> org.apache.flink.runtime.taskmanager.Task
>>>> - FATAL - exception in task resource cleanup
>>>>
>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>
>>>> 11:21:55,160 ERROR
>>>> org.apache.flink.runtime.taskmanager.Task
>>>>   - FATAL - exception in task exception handler
>>>>
>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>
>>>>
>>>>
>>>> ….
>>>>
>>>>
>>>>
>>>> - Unexpected exception in the selector loop.
>>>>
>>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Looks like the input flow is faster than the GC collector
>>>>
>>>>
>>>>
>>>> Dr. Radu Tudoran
>>>>
>>>> Research Engineer - Big Data Expert
>>>>
>>>> IT R Division
>>>>
>>>>
>>>>
>>>> [image: cid:image007.jpg@01CD52EB.AD060EE0]
>>>>
>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>>>
>>>> European Research Center
>>>>
>>>> Riesstrasse 25, 80992 München
>>>>
>>>>
>>>>
>>>> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>*
>>>>
>>>> Mobile: +49 15209084330
>>>>
>>>> Telephone: +49 891588344173
>>>>
>>>>
>>>>
>>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>>> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
>>>> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
>>>> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
>>>> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
>>>> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
>>>>
>>>> This e-mail and its attachments contain confidential information from
>>>> HUAWEI, which is intended only for the person or entity whose address is
>>>> listed above. Any use of the information contained herein in any way
>>>> (including, but not limited to, total or partial disclosure, reproduction,
>>>> or dissemination) by persons other than the intended recipient(s) is
>>>> prohibited. If you receive this e-mail in error, please notify the sender
>>>> by phone or email immediately and delete it!
>>>>
>>>>
>>>>
>>>> *From:* Till Rohrmann [mailto:trohrm...@apache.org]
>>>> *Sent:* Thursday, February 04, 2016 4:55 PM
>>>> *To:* user@flink.apache.org
>>>> *Subject:* Re: release of task slot
>>>>
>>>>
>>>>
>>>> Hi Radu,
>>>>
>>>> what does the log of the TaskManager 10.204.62.80:57910 say?
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> ​
>>>>
>>>>
>>>>
>>>> On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <radu.tudo...@huawei.com>
>>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I am facing an error which for which I cannot figure the cause. Any
>>>> idea what could cause such an error?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> java.lang.Exception: The slot in which the task was executed has been
>>>> released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @
>>>> cloudr6-admin - 4 slots - URL: akka.tcp://
>>>> flink@10.204.62.80:57910/user/taskmanager
>>>>
>>>> at
>>>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
>>>>
>>>> at
>>>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>>>>
>>>> at
>>>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>>>>
>>>> at
>>>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
>>>>
>>>> at
>>>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)

Re: release of task slot

2016-02-04 Thread Stephan Ewen
@Gyula Do you see log messages about quarantined actor systems?

There may be an issue with Akka Death watches that once the connection is
lost, it cannot be re-established unless the TaskManager is restarted

http://doc.akka.io/docs/akka/current/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model



On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <radu.tudo...@huawei.com>
wrote:

> Hi,
>
>
>
> Well…yesterday when I looked into it there was no additional info than the
> one I have send. Today I reproduced the problem and I could see in the log
> file.
>
>
>
>
>
> akka.actor.ActorInitializationException: exception during creation
>
> at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
>
> at akka.actor.ActorCell.create(ActorCell.scala:596)
>
> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
>
> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>
> at
> akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
>
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>
> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> Caused by: java.lang.reflect.InvocationTargetException
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>
> at akka.util.Reflect$.instantiate(Reflect.scala:66)
>
> at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)
>
> at akka.actor.Props.newActor(Props.scala:252)
>
> at akka.actor.ActorCell.newActor(ActorCell.scala:552)
>
> at akka.actor.ActorCell.create(ActorCell.scala:578)
>
> ... 10 more
>
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> 11:21:17,423 ERROR
> org.apache.flink.runtime.taskmanager.Task
> - FATAL - exception in task resource cleanup
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> 11:21:55,160 ERROR
> org.apache.flink.runtime.taskmanager.Task
>   - FATAL - exception in task exception handler
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
>
>
> ….
>
>
>
> - Unexpected exception in the selector loop.
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
>
>
>
>
> Looks like the input flow is faster than the GC collector
>
>
>
> Dr. Radu Tudoran
>
> Research Engineer - Big Data Expert
>
> IT R Division
>
>
>
> [image: cid:image007.jpg@01CD52EB.AD060EE0]
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>
> European Research Center
>
> Riesstrasse 25, 80992 München
>
>
>
> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>*
>
> Mobile: +49 15209084330
>
> Telephone: +49 891588344173
>
>
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
>
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>
>
>
> *From:* Till Rohrmann [mailto:trohrm...@apache.org]
> *Sent:* Thursday, February 04, 2016 4:55 PM
> *To:* user@flink.apache.org
> *Subject:* Re: release of task slot
>
>
>
> Hi Radu,
>
> what does the log of the TaskManager 10.204.62.80:57910 say?
>
> Cheers,
&

Re: release of task slot

2016-02-04 Thread Stephan Ewen
t;>
>>>
>>> Looks like the input flow is faster than the GC collector
>>>
>>>
>>>
>>> Dr. Radu Tudoran
>>>
>>> Research Engineer - Big Data Expert
>>>
>>> IT R Division
>>>
>>>
>>>
>>> [image: cid:image007.jpg@01CD52EB.AD060EE0]
>>>
>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>>
>>> European Research Center
>>>
>>> Riesstrasse 25, 80992 München
>>>
>>>
>>>
>>> E-mail: *radu.tudo...@huawei.com <radu.tudo...@huawei.com>*
>>>
>>> Mobile: +49 15209084330
>>>
>>> Telephone: +49 891588344173
>>>
>>>
>>>
>>> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>>> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
>>> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
>>> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
>>> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
>>> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
>>>
>>> This e-mail and its attachments contain confidential information from
>>> HUAWEI, which is intended only for the person or entity whose address is
>>> listed above. Any use of the information contained herein in any way
>>> (including, but not limited to, total or partial disclosure, reproduction,
>>> or dissemination) by persons other than the intended recipient(s) is
>>> prohibited. If you receive this e-mail in error, please notify the sender
>>> by phone or email immediately and delete it!
>>>
>>>
>>>
>>> *From:* Till Rohrmann [mailto:trohrm...@apache.org]
>>> *Sent:* Thursday, February 04, 2016 4:55 PM
>>> *To:* user@flink.apache.org
>>> *Subject:* Re: release of task slot
>>>
>>>
>>>
>>> Hi Radu,
>>>
>>> what does the log of the TaskManager 10.204.62.80:57910 say?
>>>
>>> Cheers,
>>> Till
>>>
>>> ​
>>>
>>>
>>>
>>> On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <radu.tudo...@huawei.com>
>>> wrote:
>>>
>>> Hello,
>>>
>>>
>>>
>>>
>>>
>>> I am facing an error which for which I cannot figure the cause. Any idea
>>> what could cause such an error?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> java.lang.Exception: The slot in which the task was executed has been
>>> released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @
>>> cloudr6-admin - 4 slots - URL: akka.tcp://
>>> flink@10.204.62.80:57910/user/taskmanager
>>>
>>> at
>>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
>>>
>>> at
>>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>>>
>>> at
>>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>>>
>>> at
>>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
>>>
>>> at
>>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
>>>
>>> at
>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)
>>>
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>
>>> at
>>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>
>>> at
>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>