[jira] [Commented] (FLINK-4715) TaskManager should commit suicide after cancellation failure

Stephan Ewen (JIRA) Mon, 17 Oct 2016 05:13:36 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15582048#comment-15582048
 ]


Stephan Ewen commented on FLINK-4715:
-------------------------------------

I think we should do the following:

Split the cancellation up into two threads:
  # The first thread calls {{cancel()}} on the task and {{interrupt()}} on the 
main thread. It then exits.
  # The second thread is a watchdog that kicks in after {{n}} seconds (default 
is 10, I think) and periodically calls {{interrupt()}} every {{n}} seconds. 
After a maximum duration (lets say 1 minute) it notifies the {{TaskManager}} of 
a fatal error. In most setups, this leads to a process kill.

The reason to separate this into two threads is that we have seen cases where 
{{cancel()}} blocks waiting on a lock held by the main thread. In that case, 
neither an {{interrupt()}} call would come, nor would the "task manager exit" 
safety net ever kick in.

> TaskManager should commit suicide after cancellation failure
> ------------------------------------------------------------
>
>                 Key: FLINK-4715
>                 URL: https://issues.apache.org/jira/browse/FLINK-4715
>             Project: Flink
>          Issue Type: Improvement
>          Components: TaskManager
>    Affects Versions: 1.2.0
>            Reporter: Till Rohrmann
>             Fix For: 1.2.0
>
>
> In case of a failed cancellation, e.g. the task cannot be cancelled after a 
> given time, the {{TaskManager}} should kill itself. That way we guarantee 
> that there is no resource leak. 
> This behaviour acts as a safety-net against faulty user code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4715) TaskManager should commit suicide after cancellation failure

Reply via email to