[ https://issues.apache.org/jira/browse/FLINK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725872#comment-16725872 ]
Nawaid Shamim commented on FLINK-11205: --------------------------------------- [~till.rohrmann] I just replied on FLINK-10317. {quote} I guess the root cause is memory leak due to dynamic loading. Limiting Metaspace to a number or throwing more memory at it would simply delay OOM. Limiting metaspace still causes OutOfMemoryError: Metaspace exception but in this case task manager dies instead of YARN killing it. I was able to reproduce the above issue in relatively smaller setup - One Master and One Core. * Start 1 Job Manager (JM). * Start 2 Task Managers - TM1 and TM2. * Submit job with global parallelism value of two so that both job is scheduled on both TMs. * Wait for job to take first checkpoint. * For every 4 minutes: ** Take heap dump of JB, TM1, TM2. ** Restart TM2 process. On every restart, TM2's JVM / YARN container is restarted. JB issues restart and restore RPC. TM2 is new process while TM1 is old process and will reload duplicate classes (that's where metaspace is exploding). I think it has something to do with org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ParentFirstClassLoader#2 {quote} > Task Manager Metaspace Memory Leak > ----------------------------------- > > Key: FLINK-11205 > URL: https://issues.apache.org/jira/browse/FLINK-11205 > Project: Flink > Issue Type: Bug > Affects Versions: 1.5.5, 1.6.2, 1.7.0 > Reporter: Nawaid Shamim > Priority: Major > Attachments: Screenshot 2018-12-18 at 12.14.11.png, Screenshot > 2018-12-18 at 15.47.55.png > > > Job Restarts causes task manager to dynamically load duplicate classes. > Metaspace is unbounded and grows with every restart. YARN aggressively kill > such containers but this affect is immediately seems on different task > manager which results in death spiral. > Task Manager uses dynamic loader as described in > [https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html] > {quote} > *YARN* > YARN classloading differs between single job deployments and sessions: > * When submitting a Flink job/application directly to YARN (via {{bin/flink > run -m yarn-cluster ...}}), dedicated TaskManagers and JobManagers are > started for that job. Those JVMs have both Flink framework classes and user > code classes in the Java classpath. That means that there is _no dynamic > classloading_ involved in that case. > * When starting a YARN session, the JobManagers and TaskManagers are started > with the Flink framework classes in the classpath. The classes from all jobs > that are submitted against the session are loaded dynamically. > {quote} > The above is not entirely true specially when you set {{-yD > classloader.resolve-order=parent-first}} . We also above observed the above > behaviour when submitting a Flink job/application directly to YARN (via > {{bin/flink run -m yarn-cluster ...}}). > !Screenshot 2018-12-18 at 12.14.11.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)