[ 
https://issues.apache.org/jira/browse/FLINK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725872#comment-16725872
 ] 

Nawaid Shamim commented on FLINK-11205:
---------------------------------------

[~till.rohrmann] I just replied on FLINK-10317. 
{quote}
I guess the root cause is memory leak due to dynamic loading. Limiting 
Metaspace to a number or throwing more memory at it would simply delay OOM. 
Limiting metaspace still causes OutOfMemoryError: Metaspace exception but in 
this case task manager dies instead of YARN killing it.

I was able to reproduce the above issue in relatively smaller setup - One 
Master and One Core. 
* Start 1 Job Manager (JM).
* Start 2 Task Managers - TM1 and TM2. 
* Submit job with global parallelism value of two so that both job is scheduled 
on both TMs. 
* Wait for job to take first checkpoint.
* For every 4 minutes:
** Take heap dump of JB, TM1, TM2. 
** Restart TM2 process. 

On every restart, TM2's JVM / YARN container is restarted. JB issues restart 
and restore RPC. TM2 is new process while TM1 is old process and will reload 
duplicate classes (that's where metaspace is exploding). I think it has 
something to do with 
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ParentFirstClassLoader#2
{quote}

> Task Manager Metaspace Memory Leak 
> -----------------------------------
>
>                 Key: FLINK-11205
>                 URL: https://issues.apache.org/jira/browse/FLINK-11205
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.5.5, 1.6.2, 1.7.0
>            Reporter: Nawaid Shamim
>            Priority: Major
>         Attachments: Screenshot 2018-12-18 at 12.14.11.png, Screenshot 
> 2018-12-18 at 15.47.55.png
>
>
> Job Restarts causes task manager to dynamically load duplicate classes. 
> Metaspace is unbounded and grows with every restart. YARN aggressively kill 
> such containers but this affect is immediately seems on different task 
> manager which results in death spiral.
> Task Manager uses dynamic loader as described in 
> [https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html]
> {quote}
> *YARN*
> YARN classloading differs between single job deployments and sessions:
>  * When submitting a Flink job/application directly to YARN (via {{bin/flink 
> run -m yarn-cluster ...}}), dedicated TaskManagers and JobManagers are 
> started for that job. Those JVMs have both Flink framework classes and user 
> code classes in the Java classpath. That means that there is _no dynamic 
> classloading_ involved in that case.
>  * When starting a YARN session, the JobManagers and TaskManagers are started 
> with the Flink framework classes in the classpath. The classes from all jobs 
> that are submitted against the session are loaded dynamically.
> {quote}
> The above is not entirely true specially when you set {{-yD 
> classloader.resolve-order=parent-first}} . We also above observed the above 
> behaviour when submitting a Flink job/application directly to YARN (via 
> {{bin/flink run -m yarn-cluster ...}}).
> !Screenshot 2018-12-18 at 12.14.11.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to