Hi David,

If I understand the problem correctly, there is really nothing we can do
here. Soft references are garbage collected when there is a high memory
pressure and the garbage collector needs to free up more memory. The
problem here is that the GC doesn't really take high memory pressure on
Metaspace into the account here.

I guess you might try to tweak _SoftRefLRUPolicyMSPerMB_ [1], but this
might have some other consequences. Also this behavior might be highly
dependent on the garbage collector you're using.


>From the docs [1]:

-XX:SoftRefLRUPolicyMSPerMB=*time*

Sets the amount of time (in milliseconds) a softly reachable object is kept
active on the heap after the last time it was referenced. The default value
is one second of lifetime per free megabyte in the heap. The
-XX:SoftRefLRUPolicyMSPerMB option accepts integer values representing
milliseconds per one megabyte of the current heap size (for Java HotSpot
Client VM) or the maximum possible heap size (for Java HotSpot Server VM).
This difference means that the Client VM tends to flush soft references
rather than grow the heap, whereas the Server VM tends to grow the heap
rather than flush soft references. In the latter case, the value of the -Xmx
option has a significant effect on how quickly soft references are garbage
collected.

The following example shows how to set the value to 2.5 seconds:

-XX:SoftRefLRUPolicyMSPerMB=2500



[1] https://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html

Best,
D.

On Thu, Jan 6, 2022 at 3:13 AM Caizhi Weng <tsreape...@gmail.com> wrote:

> Hi!
>
> As far as I remember this is a known issue a few years ago but Flink
> currently has no solution to this (correct me if I'm wrong). I see that
> you're running jobs on a yarn session. Could you switch to yarn-per-job
> mode (where JM and TMs are created and destroyed for each job) for a
> workaround?
>
> David Clutter <dclut...@yahooinc.com> 于2022年1月4日周二 23:39写道:
>
>> I am seeing an issue with class loaders not being GCed and the metaspace
>> eventually OOM.  Here is my setup:
>>
>> - Flink 1.13.1 on EMR using JDK 8 in session mode
>> - Job manager is a long-running yarn session
>> - New jobs are submitted every 5m (and typically run for less than 5m)
>>
>> I find that after a few hours the job manager gets killed with Metaspace
>> OOM.  I tried increasing the Metaspace for the job manager but that only
>> delays the OOM.
>>
>> I did some debugging using jcmd and I noticed that the size of the
>> classes loaded is always increasing.  Next I did a heap dump and found that
>> instances of org.apache.flink.util.ChildFirstClassLoader are present
>> long after the jobs complete.  Checking the GC roots I found that there is
>> a reference in java.io.ObjectStreamClass$Caches.  Seems to be this JDK
>> issue: https://bugs.openjdk.java.net/browse/JDK-8277072
>>
>> Curious if there are any workarounds for this situation?
>>
>>

Reply via email to