Hi there,

We were using flink 1.11.2 in production with a large setting. The job runs
fine for a couple of days and ends up with a restart loop caused by YARN
container memory kill. This is not observed while running against 1.9.1
with the same setting.
Here is JVM environment passed to 1.11 as well as 1.9.1 job


env.java.opts.taskmanager: '-XX:+UseG1GC -XX:MaxGCPauseMillis=500
> -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
> -XX:InitiatingHeapOccupancyPercent=45 -XX:NewRatio=1
> -XX:+PrintClassHistogram -XX:+PrintGCDateStamps -XX:+PrintGCDetails
> -XX:+PrintGCApplicationStoppedTime -Xloggc:<LOG_DIR>/gc.log'
> env.java.opts.jobmanager: '-XX:+UseG1GC -XX:MaxGCPauseMillis=500
> -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
> -XX:InitiatingHeapOccupancyPercent=45 -XX:NewRatio=1
> -XX:+PrintClassHistogram -XX:+PrintGCDateStamps -XX:+PrintGCDetails
> -XX:+PrintGCApplicationStoppedTime -Xloggc:<LOG_DIR>/gc.log'
>

After primitive investigation, we found this might not be related to jvm
heap space usage nor gc issue. Meanwhile, we observed jvm non heap usage on
some containers keep rising while job fails into restart loop as stated
below.
[image: image.png]

>From a configuration perspective, we would like to learn how the task
manager handles classloading and (unloading?) when we set include-user-jar
to first. Is there suggestions how we can have a better understanding of
how the new memory model introduced in 1.10 affects this issue?


cluster.evenly-spread-out-slots: true
zookeeper.sasl.disable: true
yarn.per-job-cluster.include-user-jar: first
yarn.properties-file.location: /usr/local/hadoop/etc/hadoop/


Thanks,
Chen

Reply via email to