[jira] [Commented] (FLINK-20333) Flink standalone cluster throws metaspace OOM after submitting multiple PyFlink UDF jobs.

Flavio Pompermaier (Jira) Wed, 25 Nov 2020 05:45:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238754#comment-17238754
 ]


Flavio Pompermaier commented on FLINK-20333:
--------------------------------------------

Could this problem affect also normal java jobs? I have the same leak in my 
Flink session cluster...actually this is the suggested leak message that the 
Eclipse MAT gives to me:

 
{code:java}
5,416 instances of "java.lang.Class", loaded by "<system class loader>" occupy 
2,706,048 (11.04%)bytes.
Biggest instances:

class java.io.ObjectStreamClass$Caches @ 0xe0f52f98 - 402,896 (1.64%) bytes. 
{code}
After some job resubmission I get the follogin exception:
{code:java}
java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has 
occurred. This can mean two things: either the job requires a larger size of 
JVM metaspace to load classes or there is a class loading leak. In the first 
case 'taskmanager.memory.jvm-metaspace.size' configuration option should be 
increased. If the error persists (usually in cluster after several job 
(re-)submissions) then there is probably a class loading leak in user code or 
some of its dependencies which has to be investigated and fixed. The task 
executor has to be shutdown...
        at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
        at java.lang.ClassLoader.defineClass(ClassLoader.java:1017) ~[?:?]
        at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174) ~[?:?]
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:550) ~[?:?]
        at java.net.URLClassLoader$1.run(URLClassLoader.java:458) ~[?:?]
        at java.net.URLClassLoader$1.run(URLClassLoader.java:452) ~[?:?]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
        at java.net.URLClassLoader.findClass(URLClassLoader.java:451) ~[?:?]
        at 
org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71)
 ~[flink-dist_2.12-1.11.0.jar:1.11.0]
        at 
org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48)
 [flink-dist_2.12-1.11.0.jar:1.11.0]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:522) [?:?
{code}
]

> Flink standalone cluster throws metaspace OOM after submitting multiple 
> PyFlink UDF jobs.
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-20333
>                 URL: https://issues.apache.org/jira/browse/FLINK-20333
>             Project: Flink
>          Issue Type: Bug
>          Components: API / Python
>            Reporter: Wei Zhong
>            Assignee: Wei Zhong
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.12.0, 1.11.3
>
>
> Currently the Flink standalone cluster will throw metaspace OOM after 
> submitting multiple PyFlink UDF jobs. The root cause is that currently the 
> PyFlink classes are running in user classloader and so each job creates a 
> separate user class loader to load PyFlink related classes. There are many 
> soft references and Finalizers in memory (introduced by the underlying 
> Netty), which prevents the garbage collection of the user classloader of 
> already finished PyFlink jobs. 
> Due to their existence, it needs multiple full gc to reclaim the classloader 
> of the completed job. If only one full gc is performed before the metaspace 
> space is insufficient, then OOM will occur.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-20333) Flink standalone cluster throws metaspace OOM after submitting multiple PyFlink UDF jobs.

Reply via email to