[ 
https://issues.apache.org/jira/browse/FLINK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888411#comment-16888411
 ] 

Joey Echeverria commented on FLINK-11205:
-----------------------------------------

I didn't get a chance to reproduce this using a sample job, but we found two 
causes for our jobs.

(1) Apache Commons Logging was caching LogFactory instances. We added the 
following code to our close() methods to release those LogFactories:

{code:java}
      ClassLoader contextLoader = 
Thread.currentThread().getContextClassLoader();
      LogFactory.release(contextLoader);
{code}

(2) We saw a leak that seemed to have an upper limit in ObjectStreamClass's 
caches. This one was trickier as we had to use reflection to clear out the 
caches, again in the close() method:

{code:java}
public static void cleanUpLeakingObjects(ClassLoader contextLoader) {
    try {
      Class<?> caches = Class.forName("java.io.ObjectStreamClass$Caches");
      clearCache(caches, "localDescs", contextLoader);
      clearCache(caches, "reflectors", contextLoader);
    } catch (ReflectiveOperationException | SecurityException | 
ClassCastException ex) {
      // Clean-up failed
      logger.warn("Cleanup of ObjectStreamClass caches failed with exception 
{}: {}", ex.getClass().getSimpleName(),
        ex.getMessage());
      logger.debug("Stack trace follows.", ex);
    }
  }


  private static void clearCache(Class<?> caches, String mapName, ClassLoader 
contextLoader)
    throws ReflectiveOperationException, SecurityException, ClassCastException {
    Field field = caches.getDeclaredField(mapName);
    field.setAccessible(true);

    Map<?, ?> map = TypeUtils.coerce(field.get(null));
    Iterator<?> keys = map.keySet().iterator();
    while (keys.hasNext()) {
      Object key = keys.next();
      if (key instanceof Reference) {
        Object clazz = ((Reference<?>) key).get();
        if (clazz instanceof Class) {
          ClassLoader cl = ((Class<?>) clazz).getClassLoader();
          while (cl != null) {
            if (cl == contextLoader) {
              keys.remove();
              break;
            }
            cl = cl.getParent();
          }
        }
      }
    }
  }
{code}

> Task Manager Metaspace Memory Leak 
> -----------------------------------
>
>                 Key: FLINK-11205
>                 URL: https://issues.apache.org/jira/browse/FLINK-11205
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.5.5, 1.6.2, 1.7.0
>            Reporter: Nawaid Shamim
>            Priority: Major
>         Attachments: Screenshot 2018-12-18 at 12.14.11.png, Screenshot 
> 2018-12-18 at 15.47.55.png
>
>
> Job Restarts causes task manager to dynamically load duplicate classes. 
> Metaspace is unbounded and grows with every restart. YARN aggressively kill 
> such containers but this affect is immediately seems on different task 
> manager which results in death spiral.
> Task Manager uses dynamic loader as described in 
> [https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html]
> {quote}
> *YARN*
> YARN classloading differs between single job deployments and sessions:
>  * When submitting a Flink job/application directly to YARN (via {{bin/flink 
> run -m yarn-cluster ...}}), dedicated TaskManagers and JobManagers are 
> started for that job. Those JVMs have both Flink framework classes and user 
> code classes in the Java classpath. That means that there is _no dynamic 
> classloading_ involved in that case.
>  * When starting a YARN session, the JobManagers and TaskManagers are started 
> with the Flink framework classes in the classpath. The classes from all jobs 
> that are submitted against the session are loaded dynamically.
> {quote}
> The above is not entirely true specially when you set {{-yD 
> classloader.resolve-order=parent-first}} . We also above observed the above 
> behaviour when submitting a Flink job/application directly to YARN (via 
> {{bin/flink run -m yarn-cluster ...}}).
> !Screenshot 2018-12-18 at 12.14.11.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to