[ 
https://issues.apache.org/jira/browse/FLINK-38180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-38180:
-----------------------------------
    Labels: pull-request-available  (was: )

> Race condition between failing Task and cancelation hiding the real exception
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-38180
>                 URL: https://issues.apache.org/jira/browse/FLINK-38180
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.14.0, 2.0.0, 1.15.4, 1.16.3, 1.17.2, 1.18.1, 1.19.3, 
> 1.20.2
>            Reporter: Piotr Nowojski
>            Assignee: Piotr Nowojski
>            Priority: Major
>              Labels: pull-request-available
>
> When task fails, during {{Task#restoreAndInvoke}} invocation we will call 
> {{finalInvokable.cleanUp(throwable)}}, which ultimately can cancel some state 
> backend operations. This cancelation can then report 
> {{Task#failExternally(CancelTaskException)}} before the task thread manages 
> to switch to {{FAILED}} state and properly set the {{failureCause}}. When 
> that happens, task switches from {{RUNNING}} to {{FAILED}}:
> {noformat}
> switched from RUNNING to FAILED due to CancelTaskException.
> {noformat}
> from that async cancellation. 
> Later, when task thread processes the real failure, Task is already in 
> {{FAILED}} state, so the real failure is being ignored.
> Example stack trace of Task thread with the race condition:
> {noformat}
> java.lang.Exception: StackTrace
>       at 
> org.apache.flink.contrib.streaming.state.RocksDBIncrementalCheckpointUtils.lambda$createAsyncRangeCompactionTask$0(RocksDBIncrementalCheckpointUtils.java:288)
>       at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:295)
>       at org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:282)
>       at 
> org.apache.flink.core.fs.CloseableRegistry.doClose(CloseableRegistry.java:65)
>       at 
> org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127)
>       at 
> org.apache.flink.runtime.state.AbstractKeyedStateBackend.close(AbstractKeyedStateBackend.java:424)
>       at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:295)
>       at org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:282)
>       at 
> org.apache.flink.core.fs.CloseableRegistry.doClose(CloseableRegistry.java:65)
>       at 
> org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127)
>       at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:255)
>       at 
> org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:72)
>       at 
> org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUp(StreamTask.java:1061)
>       at 
> org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$1(Task.java:1039)
>       at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:1057)
>       at 
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:1039) // 
> <<<<<<< (1) HERE we start to cleanUp and cancel state backend operations (as 
> shown above)
>       at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:840) // 
> <<<<<<< (2) if cleanUp/cancellation calls `failExternally` before we process 
> the failure in `doRun`, the exception will be hidden
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:654)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to