[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

2021-11-23 Thread FYung (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448365#comment-17448365
 ] 

FYung commented on FLINK-25027:
---

Hi [~nkruber]
Thank you for creating this issue. I'm so glad to see it because we have the 
same problem about akka thread pool. 

We submit batch job to flink session cluster and find there're too many task in 
akka thread pool which will cause GC. Besides 
PhysicalSlotRequestBulkCheckerImpl, we also find tasks of `heartbeat`, timeout 
checker of `pending request`/`checkIdleSlotTimeout`/`checkBatchSlotTimeout` in 
DeclarativeSlotPoolBridge, timeout checker in 
`DefaultScheduler.registerProducedPartitions` and more. Akka thread pool will 
hold these instances even the jobs are finished.

Using a dedicated thread pool per JM to manage these tasks is a good idea. 
Maybe we should create a thread pool in JM at first, then create more subtasks 
in this issue to move tasks above from akka thread pool to the thread pool in 
JM. What do you think?  Hope to hear from you [~nkruber][~trohrmann] THX :)

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -
>
> Key: FLINK-25027
> URL: https://issues.apache.org/jira/browse/FLINK-25027
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Nico Kruber
>Priority: Major
> Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

2021-11-24 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448436#comment-17448436
 ] 

Till Rohrmann commented on FLINK-25027:
---

[~zjureel] I think this is a good idea. We should avoid using the main thread 
executor for scheduling tasks that have a long delay. Do you want to work on 
this issue?

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -
>
> Key: FLINK-25027
> URL: https://issues.apache.org/jira/browse/FLINK-25027
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Nico Kruber
>Priority: Critical
> Fix For: 1.15.0, 1.14.1, 1.13.4
>
> Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

2021-11-24 Thread Shammon (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448464#comment-17448464
 ] 

Shammon commented on FLINK-25027:
-

[~trohrmann] Yes. Can you assign the issue to me? I'll be glad to work on it. 
THX

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -
>
> Key: FLINK-25027
> URL: https://issues.apache.org/jira/browse/FLINK-25027
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Nico Kruber
>Priority: Critical
> Fix For: 1.15.0, 1.14.1, 1.13.4
>
> Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

2021-11-25 Thread Martijn Visser (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449052#comment-17449052
 ] 

Martijn Visser commented on FLINK-25027:


[~zjureel] I've assigned it to you

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -
>
> Key: FLINK-25027
> URL: https://issues.apache.org/jira/browse/FLINK-25027
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Nico Kruber
>Assignee: Shammon
>Priority: Major
> Fix For: 1.15.0, 1.14.1, 1.13.4
>
> Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

2021-11-25 Thread Shammon (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449062#comment-17449062
 ] 

Shammon commented on FLINK-25027:
-

[~MartijnVisser] Thanks and I will work on it soon :)

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -
>
> Key: FLINK-25027
> URL: https://issues.apache.org/jira/browse/FLINK-25027
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Nico Kruber
>Assignee: Shammon
>Priority: Major
> Fix For: 1.15.0, 1.14.1, 1.13.4
>
> Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

2021-11-29 Thread Yangze Guo (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450246#comment-17450246
 ] 

Yangze Guo commented on FLINK-25027:


Hi, [~zjureel], thanks for your proposal. I think this is an important 
improvement for users who run batch jobs with Flink. Here I just want to share 
two cents regarding your proposal:
- Compared to just adding a thread pool to JM, how about generalizing this 
mechanism in `RpcEndpoint`? If so, other components like TM and RM can also 
leverage it in scheduling periodic tasks.
- With your proposal, we can mitigate the waste of JM's metaspace. However, 
full GCs caused by those periodic tasks can also harm the performance in the 
batch scenario. Those periodic tasks are likely to be promoted to the old 
generation before being executed. I think we'd better have a unified solution 
for the periodic tasks, which can also mitigate such promotion, if possible.

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -
>
> Key: FLINK-25027
> URL: https://issues.apache.org/jira/browse/FLINK-25027
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Nico Kruber
>Assignee: Shammon
>Priority: Major
> Fix For: 1.15.0, 1.14.1, 1.13.4
>
> Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

2021-11-29 Thread Shammon (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450409#comment-17450409
 ] 

Shammon commented on FLINK-25027:
-

[~guoyangze] Thank you for your advice, and it sounds good to me to have a 
unified solution for the periodic tasks. There're periodic tasks in 
ResourceManager and TaskManager too, they are scheduled by timer or akka thread 
pool. I will add a general scheduled thread pool in RpcEndpoint and schedules 
these periodic tasks, thanks

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -
>
> Key: FLINK-25027
> URL: https://issues.apache.org/jira/browse/FLINK-25027
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Nico Kruber
>Assignee: Shammon
>Priority: Major
> Fix For: 1.15.0, 1.14.1, 1.13.4
>
> Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

2021-11-30 Thread Yangze Guo (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450956#comment-17450956
 ] 

Yangze Guo commented on FLINK-25027:


[~zjureel] Great, would you like to edit the description of the subtasks?

> However, full GCs caused by those periodic tasks can also harm the 
> performance in the batch scenario. Those periodic tasks are likely to be 
> promoted to the old generation before being executed.

Besides, WDYT about this issue?

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -
>
> Key: FLINK-25027
> URL: https://issues.apache.org/jira/browse/FLINK-25027
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Nico Kruber
>Assignee: Shammon
>Priority: Major
> Fix For: 1.15.0, 1.14.1, 1.13.4
>
> Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

2021-12-02 Thread Shammon (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452693#comment-17452693
 ] 

Shammon commented on FLINK-25027:
-

[~guoyangze#1] I think it's a good question. The main problem of fullgc in JM 
is that it will increase the delay of small batch jobs in the session cluster. 
As you mentioned, too many periodic tasks can't be recycled in time which will 
lead to fullgc in JM. When i submit small batch jobs to flink session cluster 
as olap queries, the fullgcs in JM will increase latencies and decrease qps of 
these jobs, i think it's a very import issue

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -
>
> Key: FLINK-25027
> URL: https://issues.apache.org/jira/browse/FLINK-25027
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Nico Kruber
>Assignee: Shammon
>Priority: Major
> Fix For: 1.15.0, 1.14.1, 1.13.4
>
> Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

2022-01-11 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472541#comment-17472541
 ] 

Matthias Pohl commented on FLINK-25027:
---

Hi [~zjureel] , is there an update on the progress of this task? I'm interested 
because the {{scheduledExecutor}} feature would help us with the retry 
mechanism we're planning to integrate in FLINK-25433

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -
>
> Key: FLINK-25027
> URL: https://issues.apache.org/jira/browse/FLINK-25027
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Nico Kruber
>Assignee: Shammon
>Priority: Major
> Fix For: 1.15.0, 1.13.6, 1.14.3
>
> Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

2022-01-11 Thread Yangze Guo (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472559#comment-17472559
 ] 

Yangze Guo commented on FLINK-25027:


Hi [~mapohl], FLINK-25085 is in progressing and I think it will be merged this 
week.

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -
>
> Key: FLINK-25027
> URL: https://issues.apache.org/jira/browse/FLINK-25027
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Nico Kruber
>Assignee: Shammon
>Priority: Major
> Fix For: 1.15.0, 1.13.6, 1.14.3
>
> Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25027) Allow GC of a finished job's JobMaster before the slot timeout is reached

2022-01-12 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474544#comment-17474544
 ] 

Matthias Pohl commented on FLINK-25027:
---

Awesome, thanks for the pointer (y)

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -
>
> Key: FLINK-25027
> URL: https://issues.apache.org/jira/browse/FLINK-25027
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.14.0, 1.12.5, 1.13.3
>Reporter: Nico Kruber
>Assignee: Shammon
>Priority: Major
> Fix For: 1.15.0
>
> Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)