[jira] [Commented] (FLINK-25277) Introduce explicit shutdown signalling between TaskManager and JobManager

Niklas Semmler (Jira) Mon, 24 Jan 2022 14:05:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481432#comment-17481432
 ]


Niklas Semmler commented on FLINK-25277:
----------------------------------------

[~chesnay] Yes, you are right. [~trohrmann] needed the shutdown hook for a 
different use case, so he included the code already in 
dd6069fabf8a7ff65fbd9ff8dd7b0c47f492288f. When I saw this, I removed it from 
the commits above to avoid merge conflicts.

Also, I just want to stress, the shutdown code was really just the icing on the 
cake. All the signaling functionality was already implemented, but was just not 
called during shutdown.

> Introduce explicit shutdown signalling between TaskManager and JobManager 
> --------------------------------------------------------------------------
>
>                 Key: FLINK-25277
>                 URL: https://issues.apache.org/jira/browse/FLINK-25277
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.13.0, 1.14.0
>            Reporter: Niklas Semmler
>            Assignee: Niklas Semmler
>            Priority: Major
>              Labels: pull-request-available, reactive
>             Fix For: 1.15.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> We need to introduce shutdown signalling between TaskManager and JobManager 
> for fast & graceful shutdown in reactive scheduler mode.
> In Flink 1.14 and earlier versions, the JobManager tracks the availability of 
> a TaskManager using a hearbeat. This heartbeat is by default configured with 
> an interval of 10 seconds and a timeout of 50 seconds [1]. Hence, the 
> shutdown of a TaskManager is recognized only after about 50-60 seconds. This 
> works fine for the static scheduling mode, where a TaskManager only 
> disappears as part of a cluster shutdown or a job failure. However, in the 
> reactive scheduler mode (FLINK-10407), TaskManagers are regularly added and 
> removed from a running job. Here, the heartbeat-mechanisms incurs additional 
> delays.
> To remove these delays, we add an explicit shutdown signal from the 
> TaskManager to the JobManager.
>  
> [1]https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-25277) Introduce explicit shutdown signalling between TaskManager and JobManager

Reply via email to