Re: java.lang.Exception: TaskManager was lost/killed

Ted Yu Tue, 10 Apr 2018 18:57:07 -0700

Please see the last comment on this issue:

https://github.com/facebook/rocksdb/issues/3216


FYI

On Tue, Apr 10, 2018 at 12:25 AM, Lasse Nedergaard <
lassenederga...@gmail.com> wrote:

>
> This graph shows Non-Heap . If the same pattern exists it make sense that
> it will try to allocate more memory and then exceed the limit. I can see
> the trend for all other containers that has been killed. So my question is
> now, what is using non-heap memory?
> From http://mail-archives.apache.org/mod_mbox/flink-
> user/201707.mbox/%3CCANC1h_u0dQQvbysDAoLLbEmeWaxiimTMFjJC
> ribpfpo0idl...@mail.gmail.com%3E it look like RockDb could be guilty.
>
> I have job using incremental checkpointing and some without, some
> optimised for FLASH_SSD. all have same pattern
>
> Lasse
>
>
>
> 2018-04-10 8:52 GMT+02:00 Lasse Nedergaard <lassenederga...@gmail.com>:
>
>> Hi.
>>
>> I found the exception attached below, for our simple job. It states that
>> our task-manager was killed du to exceed memory limit on 2.7GB. But when I
>> look at Flink metricts just 30 sec before it use 1.3 GB heap and 712 MB
>> Non-Heap around 2 GB.
>> So something else are also using memory inside the conatianer any idea
>> how to figure out what?
>> As a side note we use RockDBStateBackend with this configuration
>>
>> env.getCheckpointConfig().setMinPauseBetweenCheckpoints((long)(config.checkPointInterval
>>  * 0.75));
>> env.enableCheckpointing(config.checkPointInterval, 
>> CheckpointingMode.AT_LEAST_ONCE);
>> env.setStateBackend(new RocksDBStateBackend(config.checkpointDataUri));
>>
>> Where checkpointDataUri point to S3
>>
>> Lasse Nedergaard
>>
>> 2018-04-09 16:52:01,239 INFO  org.apache.flink.yarn.YarnFlin
>> kResourceManager                - Diagnostics for container
>> container_1522921976871_0001_01_000079 in state COMPLETE :
>> exitStatus=Pmem limit exceeded (-104) diagnostics=Container
>> [pid=30118,containerID=container_1522921976871_0001_01_000079] is
>> running beyond physical memory limits. Current usage: 2.7 GB of 2.7 GB
>> physical memory used; 4.9 GB of 13.4 GB virtual memory used. Killing
>> container.
>>
>> Dump of the process-tree for container_1522921976871_0001_01_000079 :
>>
>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>>
>> |- 30136 30118 30118 30118 (java) 245173 68463 5193723904 703845
>> /usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m
>> -Dlog.file=/var/log/hadoop-yarn/containers/application_15229
>> 21976871_0001/container_1522921976871_0001_01_000079/taskmanager.log
>> -Dlogback.configurationFile=file:./logback.xml
>> -Dlog4j.configuration=file:./log4j.properties
>> org.apache.flink.yarn.YarnTaskManager --configDir .
>>
>> |- 30118 30116 30118 30118 (bash) 0 0 115818496 674 /bin/bash -c
>> /usr/lib/jvm/java-openjdk/bin/java -Xms2063m -Xmx2063m
>> -Dlog.file=/var/log/hadoop-yarn/containers/application_15229
>> 21976871_0001/container_1522921976871_0001_01_000079/taskmanager.log
>> -Dlogback.configurationFile=file:./logback.xml
>> -Dlog4j.configuration=file:./log4j.properties
>> org.apache.flink.yarn.YarnTaskManager --configDir . 1>
>> /var/log/hadoop-yarn/containers/application_1522921976871_
>> 0001/container_1522921976871_0001_01_000079/taskmanager.out 2>
>> /var/log/hadoop-yarn/containers/application_1522921976871_
>> 0001/container_1522921976871_0001_01_000079/taskmanager.err
>>
>>
>> 2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter
>>                     - gauge.ip-10-1-1-181.taskmanage
>> r.container_1522921976871_0001_01_000079.Status.JVM.
>> Memory.Heap.Used=1398739496
>>
>>
>> 2018-04-09 16:51:26,659 DEBUG org.trackunit.tm2.LogReporter
>>                     - gauge.ip-10-1-1-181.taskmanage
>> r.container_1522921976871_0001_01_000079.Status.JVM.
>> Memory.NonHeap.Used=746869520
>>
>>
>>
>>
>>
>> 2018-04-09 23:52 GMT+02:00 Ken Krugler <kkrugler_li...@transpac.com>:
>>
>>> Hi Chesnay,
>>>
>>> Don’t know if this helps, but I’d run into this as well, though I
>>> haven’t hooked up YourKit to analyze exactly what’s causing the memory
>>> problem.
>>>
>>> E.g. after about 3.5 hours running locally, it failed with memory issues.
>>>
>>> In the TaskManager logs, I start seeing exceptions in my code….
>>>
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>
>>> And then eventually...
>>>
>>> 2018-04-07 21:55:25,686 WARN  
>>> org.apache.flink.runtime.accumulators.AccumulatorRegistry
>>>     - Failed to serialize accumulators for task.
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>
>>> Immediately after this, one of my custom functions gets a close() call,
>>> and I see a log msg about it "switched from RUNNING to FAILED”.
>>>
>>> After this, I see messages that the job is being restarted, but the
>>> TaskManager log output abruptly ends.
>>>
>>> In the Job Manager log, this is what is output following the time of the
>>> last TaskManager logging output:
>>>
>>> 2018-04-07 21:57:33,702 INFO  
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>     - Triggering checkpoint 129 @ 1523163453702
>>> 2018-04-07 21:58:43,916 WARN  akka.remote.ReliableDeliverySupervisor
>>>                      - Association with remote system [
>>> akka.tcp://fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed,
>>> address is now gated for [5000] ms. Reason: [Disassociated]
>>> 2018-04-07 21:58:51,084 WARN  akka.remote.transport.netty.NettyTransport
>>>                    - Remote connection to [null] failed with
>>> java.net.ConnectException: Connection refused: kens-
>>> mbp.hsd1.ca.comcast.net/192.168.3.177:63780
>>> 2018-04-07 21:58:51,086 WARN  akka.remote.ReliableDeliverySupervisor
>>>                      - Association with remote system [akka.tcp://
>>> fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is now
>>> gated for [5000] ms. Reason: [Association failed with [akka.tcp://
>>> fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection
>>> refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780]
>>> 2018-04-07 21:59:01,047 WARN  akka.remote.transport.netty.NettyTransport
>>>                    - Remote connection to [null] failed with
>>> java.net.ConnectException: Connection refused:
>>> kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780
>>> 2018-04-07 21:59:01,050 WARN  akka.remote.ReliableDeliverySupervisor
>>>                      - Association with remote system [akka.tcp://
>>> fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is now
>>> gated for [5000] ms. Reason: [Association failed with [akka.tcp://
>>> fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection
>>> refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780]
>>> 2018-04-07 21:59:11,057 WARN  akka.remote.ReliableDeliverySupervisor
>>>                      - Association with remote system [akka.tcp://
>>> fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is now
>>> gated for [5000] ms. Reason: [Association failed with [akka.tcp://
>>> fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection
>>> refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780]
>>> 2018-04-07 21:59:11,058 WARN  akka.remote.transport.netty.NettyTransport
>>>                    - Remote connection to [null] failed with
>>> java.net.ConnectException: Connection refused:
>>> kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780
>>> 2018-04-07 21:59:21,049 WARN  akka.remote.transport.netty.NettyTransport
>>>                    - Remote connection to [null] failed with
>>> java.net.ConnectException: Connection refused:
>>> kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780
>>> 2018-04-07 21:59:21,049 WARN  akka.remote.ReliableDeliverySupervisor
>>>                      - Association with remote system [akka.tcp://
>>> fl...@kens-mbp.hsd1.ca.comcast.net:63780] has failed, address is now
>>> gated for [5000] ms. Reason: [Association failed with [akka.tcp://
>>> fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection
>>> refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780]
>>> 2018-04-07 21:59:21,056 WARN  akka.remote.RemoteWatcher
>>>                     - Detected unreachable: [akka.tcp://
>>> fl...@kens-mbp.hsd1.ca.comcast.net:63780]
>>> 2018-04-07 21:59:21,063 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>                - Task manager akka.tcp://flink@kens-mbp.hsd1
>>> .ca.comcast.net:63780/user/taskmanager terminated.
>>> 2018-04-07 21:59:21,064 INFO  
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>        - FetchUrlsFunction for sitemap -> ParseSiteMapFunction ->
>>> OutlinkToStateUrlFunction (1/1) (3e9374d1bf5fdb359e3a624a4d5d659b)
>>> switched from RUNNING to FAILED.
>>> java.lang.Exception: TaskManager was lost/killed:
>>> c51d3879b6244828eb9fc78c943007ad @ kens-mbp.hsd1.ca.comcast.net
>>> (dataPort=63782)
>>>
>>> — Ken
>>>
>>>
>>>
>>> On Apr 9, 2018, at 12:48 PM, Chesnay Schepler <ches...@apache.org>
>>> wrote:
>>>
>>> We will need more information to offer any solution. The exception
>>> simply means that a TaskManager shut down, for which there are a myriad of
>>> possible explanations.
>>>
>>> Please have a look at the TaskManager logs, they may contain a hint as
>>> to why it shut down.
>>>
>>> On 09.04.2018 16:01, Javier Lopez wrote:
>>>
>>> Hi,
>>>
>>> "are you moving the job  jar to  the ~/flink-1.4.2/lib path ?  " -> Yes,
>>> to every node in the cluster.
>>>
>>> On 9 April 2018 at 15:37, miki haiat <miko5...@gmail.com> wrote:
>>>
>>>> Javier
>>>> "adding the jar file to the /lib path of every task manager"
>>>> are you moving the job  jar to  the* ~/flink-1.4.2/lib path* ?
>>>>
>>>> On Mon, Apr 9, 2018 at 12:23 PM, Javier Lopez <javier.lo...@zalando.de>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We had the same metaspace problem, it was solved by adding the jar
>>>>> file to the /lib path of every task manager, as explained here
>>>>> https://ci.apache.org/projects/flink/flink-docs-release
>>>>> -1.4/monitoring/debugging_classloading.html#avoiding-dynamic
>>>>> -classloading. As well we added these java options:
>>>>> "-XX:CompressedClassSpaceSize=100M -XX:MaxMetaspaceSize=300M
>>>>> -XX:MetaspaceSize=200M "
>>>>>
>>>>> From time to time we have the same problem with TaskManagers
>>>>> disconnecting, but the logs are not useful. We are using 1.3.2.
>>>>>
>>>>> On 9 April 2018 at 10:41, Alexander Smirnov <
>>>>> alexander.smirn...@gmail.com> wrote:
>>>>>
>>>>>> I've seen similar problem, but it was not a heap size, but Metaspace.
>>>>>> It was caused by a job restarting in a loop. Looks like for each
>>>>>> restart, Flink loads new instance of classes and very soon in runs out of
>>>>>> metaspace.
>>>>>>
>>>>>> I've created a JIRA issue for this problem, but got no response from
>>>>>> the development team on it: https://issues.apache.org/
>>>>>> jira/browse/FLINK-9132
>>>>>>
>>>>>>
>>>>>> On Mon, Apr 9, 2018 at 11:36 AM 王凯 <wangka...@163.com> wrote:
>>>>>>
>>>>>>> thanks a lot,i will try it
>>>>>>>
>>>>>>> 在 2018-04-09 00:06:02，"TechnoMage" <mla...@technomage.com> 写道：
>>>>>>>
>>>>>>> I have seen this when my task manager ran out of RAM.  Increase the
>>>>>>> heap size.
>>>>>>>
>>>>>>> flink-conf.yaml:
>>>>>>> taskmanager.heap.mb
>>>>>>> jobmanager.heap.mb
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>> On Apr 8, 2018, at 2:36 AM, 王凯 <wangka...@163.com> wrote:
>>>>>>>
>>>>>>> <QQ图片20180408163927.png>
>>>>>>> hi all, recently, i found a problem,it runs well when start. But
>>>>>>> after long run,the exception display as above,how can resolve it?
>>>>>>>
>>>>>>>
>>> --------------------------------------------
>>> http://about.me/kkrugler
>>> +1 530-210-6378
>>>
>>>
>>
>

Re: java.lang.Exception: TaskManager was lost/killed

Reply via email to