[ https://issues.apache.org/jira/browse/FLINK-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15582048#comment-15582048 ]
Stephan Ewen commented on FLINK-4715: ------------------------------------- I think we should do the following: Split the cancellation up into two threads: # The first thread calls {{cancel()}} on the task and {{interrupt()}} on the main thread. It then exits. # The second thread is a watchdog that kicks in after {{n}} seconds (default is 10, I think) and periodically calls {{interrupt()}} every {{n}} seconds. After a maximum duration (lets say 1 minute) it notifies the {{TaskManager}} of a fatal error. In most setups, this leads to a process kill. The reason to separate this into two threads is that we have seen cases where {{cancel()}} blocks waiting on a lock held by the main thread. In that case, neither an {{interrupt()}} call would come, nor would the "task manager exit" safety net ever kick in. > TaskManager should commit suicide after cancellation failure > ------------------------------------------------------------ > > Key: FLINK-4715 > URL: https://issues.apache.org/jira/browse/FLINK-4715 > Project: Flink > Issue Type: Improvement > Components: TaskManager > Affects Versions: 1.2.0 > Reporter: Till Rohrmann > Fix For: 1.2.0 > > > In case of a failed cancellation, e.g. the task cannot be cancelled after a > given time, the {{TaskManager}} should kill itself. That way we guarantee > that there is no resource leak. > This behaviour acts as a safety-net against faulty user code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)