[
https://issues.apache.org/jira/browse/FLINK-3800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327372#comment-15327372
]
ASF GitHub Bot commented on FLINK-3800:
---------------------------------------
GitHub user tillrohrmann opened a pull request:
https://github.com/apache/flink/pull/2096
[FLINK-3800] [runtime] Introduce SUSPENDED job status
The SUSPENDED job status is a new ExecutionGraph state which can be reached
from all
non-terminal states when calling suspend on the ExecutionGraph. Unlike the
FAILED,
FINISHED and CANCELED state, the SUSPENDED state does not trigger the
deletion of the
job from the HA storage. Therefore, this state can be used to handle the
loss of
leadership or the shutdown of a JobManager so that the ExecutionGraph is
stopped but
can still be recovered. SUSPENDED is also a terminal state but it can be
differentiated as
a locally terminal state from FAILED, CANCELED and FINISHED which are
globally
terminal states.
Add test case for suspend signal
Add test case for suspending restarting job
Add test case for HA job recovery when losing leadership
Add online documentation for the job status
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tillrohrmann/flink fixHALifecycle
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/2096.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2096
----
commit 0d3c738e85fed2e161bc724887a4d8ce06a2798c
Author: Till Rohrmann <[email protected]>
Date: 2016-06-09T09:37:14Z
[FLINK-3800] [runtime] Introduce SUSPENDED job status
The SUSPENDED job status is a new ExecutionGraph state which can be reached
from all
non-terminal states when calling suspend on the ExecutionGraph. Unlike the
FAILED,
FINISHED and CANCELED state, the SUSPENDED state does not trigger the
deletion of the
job from the HA storage. Therefore, this state can be used to handle the
loss of
leadership or the shutdown of a JobManager so that the ExecutionGraph is
stopped but
can still be recovered. SUSPENDED is also a terminal state but it can be
differentiated as
a locally terminal state from FAILED, CANCELED and FINISHED which are
globally
terminal states.
Add test case for suspend signal
Add test case for suspending restarting job
Add test case for HA job recovery when losing leadership
Add online documentation for the job status
----
> ExecutionGraphs can become orphans
> ----------------------------------
>
> Key: FLINK-3800
> URL: https://issues.apache.org/jira/browse/FLINK-3800
> Project: Flink
> Issue Type: Bug
> Components: JobManager
> Affects Versions: 1.0.0, 1.1.0
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
>
> The {{JobManager.cancelAndClearEverything}} method fails all currently
> executed jobs on the {{JobManager}} and then clears the list of
> {{currentJobs}} kept in the JobManager. This can become problematic if the
> user has set a restart strategy for a job, because the {{RestartStrategy}}
> will try to restart the job. This can lead to unwanted re-deployments of the
> job which consumes resources and thus will trouble the execution of other
> jobs. If the restart strategy never stops, then this prevents that the
> {{ExecutionGraph}} from ever being properly terminated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)