[jira] [Commented] (FLINK-24182) Tasks canceler should not immediately interrupt

Piotr Nowojski (Jira) Mon, 13 Sep 2021 08:58:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-24182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414277#comment-17414277
 ]


Piotr Nowojski commented on FLINK-24182:
----------------------------------------

One more issue that popped up during implementation. If task is already 
failing, due to some exception for example thrown from a mailbox action (like 
{{snapshotState}}), we have to cancel the source function obviously 
(FLINK-21990). But should we be interrupting it? Should we register JVM killer 
watch dog as well?

After some longer discussion with [~dwysakowicz], we came to a conclusion that 
probably yes. We know we are already failing, so let's make sure that the task 
will close/release resources in a timely manner. If not, let's kill the JVM.

We have also discussed if we should register the same watchdogs if task is 
doing clean shutdown (bounded input case). What if task is deadlocked in some 
close method? But this is very similar to a deadlock/livelock inside normal 
run/invoke method. And we don't know if task is just taking very long time to 
close, so in that case we concluded it would be better to wait indefinitely in 
clean shutdown. In this case user would have to manually cancel the job.

> Tasks canceler should not immediately interrupt
> -----------------------------------------------
>
>                 Key: FLINK-24182
>                 URL: https://issues.apache.org/jira/browse/FLINK-24182
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Task
>            Reporter: Arvid Heise
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>
> While debugging resource leaks (FLINK-24131), I found that any connector is 
> immediately interrupted on cancel. Hence, any attempts of using blocking 
> calls in {{close}} to cleanup resources are immediately unreliable (e.g. 
> aborting transactions).
> It would be nice if tasks get a grace period (e.g. 
> task.cancellation.interval) where they can try to free resources in a proper, 
> potentially blocking fashion before being interrupted.
> Nevertheless, connectors should always expect interruptions during shutdown, 
> in particular when the user-configurable grace period is depleted. I'd add 
> that to the connector documentation in a separate effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-24182) Tasks canceler should not immediately interrupt

Reply via email to