[ https://issues.apache.org/jira/browse/FLINK-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327372#comment-15327372 ]
ASF GitHub Bot commented on FLINK-3800: --------------------------------------- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2096 [FLINK-3800] [runtime] Introduce SUSPENDED job status The SUSPENDED job status is a new ExecutionGraph state which can be reached from all non-terminal states when calling suspend on the ExecutionGraph. Unlike the FAILED, FINISHED and CANCELED state, the SUSPENDED state does not trigger the deletion of the job from the HA storage. Therefore, this state can be used to handle the loss of leadership or the shutdown of a JobManager so that the ExecutionGraph is stopped but can still be recovered. SUSPENDED is also a terminal state but it can be differentiated as a locally terminal state from FAILED, CANCELED and FINISHED which are globally terminal states. Add test case for suspend signal Add test case for suspending restarting job Add test case for HA job recovery when losing leadership Add online documentation for the job status You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixHALifecycle Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2096.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2096 ---- commit 0d3c738e85fed2e161bc724887a4d8ce06a2798c Author: Till Rohrmann <trohrm...@apache.org> Date: 2016-06-09T09:37:14Z [FLINK-3800] [runtime] Introduce SUSPENDED job status The SUSPENDED job status is a new ExecutionGraph state which can be reached from all non-terminal states when calling suspend on the ExecutionGraph. Unlike the FAILED, FINISHED and CANCELED state, the SUSPENDED state does not trigger the deletion of the job from the HA storage. Therefore, this state can be used to handle the loss of leadership or the shutdown of a JobManager so that the ExecutionGraph is stopped but can still be recovered. SUSPENDED is also a terminal state but it can be differentiated as a locally terminal state from FAILED, CANCELED and FINISHED which are globally terminal states. Add test case for suspend signal Add test case for suspending restarting job Add test case for HA job recovery when losing leadership Add online documentation for the job status ---- > ExecutionGraphs can become orphans > ---------------------------------- > > Key: FLINK-3800 > URL: https://issues.apache.org/jira/browse/FLINK-3800 > Project: Flink > Issue Type: Bug > Components: JobManager > Affects Versions: 1.0.0, 1.1.0 > Reporter: Till Rohrmann > Assignee: Till Rohrmann > > The {{JobManager.cancelAndClearEverything}} method fails all currently > executed jobs on the {{JobManager}} and then clears the list of > {{currentJobs}} kept in the JobManager. This can become problematic if the > user has set a restart strategy for a job, because the {{RestartStrategy}} > will try to restart the job. This can lead to unwanted re-deployments of the > job which consumes resources and thus will trouble the execution of other > jobs. If the restart strategy never stops, then this prevents that the > {{ExecutionGraph}} from ever being properly terminated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)