[
https://issues.apache.org/jira/browse/FLINK-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15582048#comment-15582048
]
Stephan Ewen commented on FLINK-4715:
-------------------------------------
I think we should do the following:
Split the cancellation up into two threads:
# The first thread calls {{cancel()}} on the task and {{interrupt()}} on the
main thread. It then exits.
# The second thread is a watchdog that kicks in after {{n}} seconds (default
is 10, I think) and periodically calls {{interrupt()}} every {{n}} seconds.
After a maximum duration (lets say 1 minute) it notifies the {{TaskManager}} of
a fatal error. In most setups, this leads to a process kill.
The reason to separate this into two threads is that we have seen cases where
{{cancel()}} blocks waiting on a lock held by the main thread. In that case,
neither an {{interrupt()}} call would come, nor would the "task manager exit"
safety net ever kick in.
> TaskManager should commit suicide after cancellation failure
> ------------------------------------------------------------
>
> Key: FLINK-4715
> URL: https://issues.apache.org/jira/browse/FLINK-4715
> Project: Flink
> Issue Type: Improvement
> Components: TaskManager
> Affects Versions: 1.2.0
> Reporter: Till Rohrmann
> Fix For: 1.2.0
>
>
> In case of a failed cancellation, e.g. the task cannot be cancelled after a
> given time, the {{TaskManager}} should kill itself. That way we guarantee
> that there is no resource leak.
> This behaviour acts as a safety-net against faulty user code.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)