Hi! As far as I remember this is a known issue a few years ago but Flink currently has no solution to this (correct me if I'm wrong). I see that you're running jobs on a yarn session. Could you switch to yarn-per-job mode (where JM and TMs are created and destroyed for each job) for a workaround?
David Clutter <dclut...@yahooinc.com> 于2022年1月4日周二 23:39写道: > I am seeing an issue with class loaders not being GCed and the metaspace > eventually OOM. Here is my setup: > > - Flink 1.13.1 on EMR using JDK 8 in session mode > - Job manager is a long-running yarn session > - New jobs are submitted every 5m (and typically run for less than 5m) > > I find that after a few hours the job manager gets killed with Metaspace > OOM. I tried increasing the Metaspace for the job manager but that only > delays the OOM. > > I did some debugging using jcmd and I noticed that the size of the classes > loaded is always increasing. Next I did a heap dump and found that > instances of org.apache.flink.util.ChildFirstClassLoader are present long > after the jobs complete. Checking the GC roots I found that there is a > reference in java.io.ObjectStreamClass$Caches. Seems to be this JDK > issue: https://bugs.openjdk.java.net/browse/JDK-8277072 > > Curious if there are any workarounds for this situation? > >