Re: java.lang.Exception: TaskManager was lost/killed

Lasse Nedergaard Mon, 09 Apr 2018 13:41:46 -0700

We see the same running 1.4.2 on Yarn hosted on Aws EMR cluster. The only thing 
I can find in the logs from are SIGTERM with the code 15 or -100. 
Today our simple job reading from Kinesis and writing to Cassandra was killed. 
The other day in another job I identified a map state.remove command to cause a 
task manager lost without and exception
I find it frustrating that it is so hard to find the root cause. 
If I look on historical metrics on cpu, heap and non heap I can’t see anything 
that should cause a problem. 
So any ideas about how to debug this kind of exception is much appreciated.


Med venlig hilsen / Best regards
Lasse Nedergaard


> Den 9. apr. 2018 kl. 21.48 skrev Chesnay Schepler <ches...@apache.org>:
> 
> We will need more information to offer any solution. The exception simply 
> means that a TaskManager shut down, for which there are a myriad of possible 
> explanations.
> 
> Please have a look at the TaskManager logs, they may contain a hint as to why 
> it shut down.
> 
>> On 09.04.2018 16:01, Javier Lopez wrote:
>> Hi,
>> 
>> "are you moving the job  jar to  the ~/flink-1.4.2/lib path ?  " -> Yes, to 
>> every node in the cluster.
>> 
>>> On 9 April 2018 at 15:37, miki haiat <miko5...@gmail.com> wrote:
>>> Javier 
>>> "adding the jar file to the /lib path of every task manager"
>>> are you moving the job  jar to  the ~/flink-1.4.2/lib path ?  
>>> 
>>>> On Mon, Apr 9, 2018 at 12:23 PM, Javier Lopez <javier.lo...@zalando.de> 
>>>> wrote:
>>>> Hi,
>>>> 
>>>> We had the same metaspace problem, it was solved by adding the jar file to 
>>>> the /lib path of every task manager, as explained here 
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading.
>>>>  As well we added these java options: "-XX:CompressedClassSpaceSize=100M 
>>>> -XX:MaxMetaspaceSize=300M -XX:MetaspaceSize=200M "
>>>> 
>>>> From time to time we have the same problem with TaskManagers 
>>>> disconnecting, but the logs are not useful. We are using 1.3.2.
>>>> 
>>>>> On 9 April 2018 at 10:41, Alexander Smirnov 
>>>>> <alexander.smirn...@gmail.com> wrote:
>>>>> I've seen similar problem, but it was not a heap size, but Metaspace.
>>>>> It was caused by a job restarting in a loop. Looks like for each restart, 
>>>>> Flink loads new instance of classes and very soon in runs out of 
>>>>> metaspace.
>>>>> 
>>>>> I've created a JIRA issue for this problem, but got no response from the 
>>>>> development team on it: https://issues.apache.org/jira/browse/FLINK-9132
>>>>> 
>>>>> 
>>>>>> On Mon, Apr 9, 2018 at 11:36 AM 王凯 <wangka...@163.com> wrote:
>>>>>> thanks a lot,i will try it
>>>>>> 
>>>>>> 在 2018-04-09 00:06:02，"TechnoMage" <mla...@technomage.com> 写道：
>>>>>> I have seen this when my task manager ran out of RAM.  Increase the heap 
>>>>>> size.
>>>>>> 
>>>>>> flink-conf.yaml:
>>>>>> taskmanager.heap.mb 
>>>>>> jobmanager.heap.mb
>>>>>> 
>>>>>> Michael
>>>>>> 
>>>>>>> On Apr 8, 2018, at 2:36 AM, 王凯 <wangka...@163.com> wrote:
>>>>>>> 
>>>>>>> <QQ图片20180408163927.png>
>>>>>>> hi all, recently, i found a problem,it runs well when start. But after 
>>>>>>> long run,the exception display as above,how can resolve it?
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  
>>>>>> 
>>>> 
>>> 
>> 
>

Re: java.lang.Exception: TaskManager was lost/killed

Reply via email to