I am seeing an issue with class loaders not being GCed and the metaspace
eventually OOM.  Here is my setup:

- Flink 1.13.1 on EMR using JDK 8 in session mode
- Job manager is a long-running yarn session
- New jobs are submitted every 5m (and typically run for less than 5m)

I find that after a few hours the job manager gets killed with Metaspace
OOM.  I tried increasing the Metaspace for the job manager but that only
delays the OOM.

I did some debugging using jcmd and I noticed that the size of the classes
loaded is always increasing.  Next I did a heap dump and found that
instances of org.apache.flink.util.ChildFirstClassLoader are present long
after the jobs complete.  Checking the GC roots I found that there is a
reference in java.io.ObjectStreamClass$Caches.  Seems to be this JDK issue:
https://bugs.openjdk.java.net/browse/JDK-8277072

Curious if there are any workarounds for this situation?

Reply via email to