[
https://issues.apache.org/jira/browse/FLINK-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15582735#comment-15582735
]
ASF GitHub Bot commented on FLINK-4715:
---------------------------------------
GitHub user uce opened a pull request:
https://github.com/apache/flink/pull/2652
[FLINK-4715] Fail TaskManager with fatal error if task cancellation is stuck
- Splits the cancellation up into two threads:
* The `TaskCanceler` calls `cancel` on the invokable and `interrupt` on
the executing Thread. It then exists.
* The `TaskCancellationWatchDog` kicks in after the task cancellation
timeout (current default: 30 secs) and periodically calls `interrupt` on the
executing Thread. If the Thread does not terminate within the task cancellation
timeout (new config value, default 3 mins), the task manager is notified about
a fatal error, leading to termination of the JVM.
- The new configuration is exposed via
`ConfigConstants.TASK_CANCELLATION_TIMEOUT_MILLIS`
(default: 3 mins) and the `ExecutionConfig` (similar to the cancellation
interval).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/uce/flink 4715-suicide
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/2652.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2652
----
----
> TaskManager should commit suicide after cancellation failure
> ------------------------------------------------------------
>
> Key: FLINK-4715
> URL: https://issues.apache.org/jira/browse/FLINK-4715
> Project: Flink
> Issue Type: Improvement
> Components: TaskManager
> Affects Versions: 1.2.0
> Reporter: Till Rohrmann
> Assignee: Ufuk Celebi
> Fix For: 1.2.0
>
>
> In case of a failed cancellation, e.g. the task cannot be cancelled after a
> given time, the {{TaskManager}} should kill itself. That way we guarantee
> that there is no resource leak.
> This behaviour acts as a safety-net against faulty user code.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)