[ https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yun Gao updated FLINK-25027: ---------------------------- Fix Version/s: 1.16.0 > Allow GC of a finished job's JobMaster before the slot timeout is reached > ------------------------------------------------------------------------- > > Key: FLINK-25027 > URL: https://issues.apache.org/jira/browse/FLINK-25027 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.14.0, 1.12.5, 1.13.3 > Reporter: Nico Kruber > Assignee: Shammon > Priority: Major > Fix For: 1.15.0, 1.16.0 > > Attachments: image-2021-11-23-20-32-20-479.png > > > In a session cluster, after a (batch) job is finished, the JobMaster seems to > stick around for another couple of minutes before being eligible for garbage > collection. > Looking into a heap dump, it seems to be tied to a > {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying > Akka executor (and keeps the JM from being GC’d). Per default the action is > scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks > [~trohrmann] for helping out here) > !image-2021-11-23-20-32-20-479.png! > With this setting, you will have to account for enough metaspace to cover 5 > minutes of time which may span a couple of jobs, needlessly! > The problem seems to be that Flink is using the main thread executor for the > scheduling that uses the {{ActorSystem}}'s scheduler and the future task > scheduled with Akka can (probably) not be easily cancelled. > One idea could be to use a dedicated thread pool per JM, that we shut down > when the JM terminates. That way we would not keep the JM from being GC’d. > (The concrete example we investigated was a DataSet job) -- This message was sent by Atlassian Jira (v8.20.1#820001)