After the job is running for 10 days in production, TaskManagers start
failing with:

Connection unexpectedly closed by remote task manager

Looking in the machine logs, I can see the following error:

============= Java processes for user hadoop =============
OpenJDK 64-Bit Server VM warning: INFO:
os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; error='Cannot
allocate memory' (err
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1006567424 bytes for
committing reserved memory.
# An error report file with more information is saved as:
# /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log
=========== End java processes for user hadoop ===========

In addition, the metrics for the TaskManager show very low Heap memory
consumption (20% of Xmx).

Hence, I suspect there is a memory leak in the TaskManager's Managed Memory.

This my TaskManager's memory detail:
flink process 112g
framework.heap.size 0.2g
task.heap.size 50g
managed.size 54g
framework.off-heap.size 0.5g
task.off-heap.size 1g
network 2g
XX:MaxMetaspaceSize 1g

As you can see, the managed memory is 54g, so it's already high (my
managed.fraction is set to 0.5).

I'm running Flink 1.10. Full job details attached.

Can someone advise what would cause a managed memory leak?
--------------------------------------------------------------------------------
Starting YARN TaskExecutor runner (Version: 1.10.0, Rev:<unknown>, 
Date:<unknown>)
OS current user: yarn
Current Hadoop/Kerberos user: hadoop
JVM: OpenJDK 64-Bit Server VM - Amazon.com Inc. - 1.8/25.252-b09
Maximum heap size: 52224 MiBytes
JAVA_HOME: /etc/alternatives/jre
Hadoop version: 2.8.5-amzn-6
JVM Options:
   -Xmx54760833024
   -Xms54760833024
   -XX:MaxDirectMemorySize=3758096384
   -XX:MaxMetaspaceSize=1073741824
   -XX:+UseG1GC
   
-Dlog.file=/var/log/hadoop-yarn/containers/application_1600334141629_0011/container_1600334141629_0011_01_000002/taskmanager.log
   -Dlog4j.configuration=file:./log4j.properties
Program Arguments:
   -D taskmanager.memory.framework.off-heap.size=536870912b
   -D taskmanager.memory.network.max=2147483648b
   -D taskmanager.memory.network.min=2147483648b
   -D taskmanager.memory.framework.heap.size=134217728b
   -D taskmanager.memory.managed.size=58518929408b
   -D taskmanager.cpu.cores=7.0
   -D taskmanager.memory.task.heap.size=54626615296b
   -D taskmanager.memory.task.off-heap.size=1073741824b
   --configDir .
   -Djobmanager.rpc.address=ip-***.us-west-2.compute.internal
   -Dweb.port=0
   -Dweb.tmpdir=/tmp/flink-web-ad601f25-685f-42e5-aa93-9658233031e4
   -Djobmanager.rpc.port=35435
   -Drest.address=ip-***.us-west-2.compute.internal

Reply via email to