[
https://issues.apache.org/jira/browse/FLINK-38180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Piotr Nowojski closed FLINK-38180.
----------------------------------
Fix Version/s: 2.1.0
Resolution: Fixed
merged to master as dc2627b565d
> Race condition between failing Task and cancelation hiding the real exception
> -----------------------------------------------------------------------------
>
> Key: FLINK-38180
> URL: https://issues.apache.org/jira/browse/FLINK-38180
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.14.0, 2.0.0, 1.15.4, 1.16.3, 1.17.2, 1.18.1, 1.19.3,
> 1.20.2
> Reporter: Piotr Nowojski
> Assignee: Piotr Nowojski
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.1.0
>
>
> When task fails, during {{Task#restoreAndInvoke}} invocation we will call
> {{finalInvokable.cleanUp(throwable)}}, which ultimately can cancel some state
> backend operations. This cancelation can then report
> {{Task#failExternally(CancelTaskException)}} before the task thread manages
> to switch to {{FAILED}} state and properly set the {{failureCause}}. When
> that happens, task switches from {{RUNNING}} to {{FAILED}}:
> {noformat}
> switched from RUNNING to FAILED due to CancelTaskException.
> {noformat}
> from that async cancellation.
> Later, when task thread processes the real failure, Task is already in
> {{FAILED}} state, so the real failure is being ignored.
> Example stack trace of Task thread with the race condition:
> {noformat}
> java.lang.Exception: StackTrace
> at
> org.apache.flink.contrib.streaming.state.RocksDBIncrementalCheckpointUtils.lambda$createAsyncRangeCompactionTask$0(RocksDBIncrementalCheckpointUtils.java:288)
> at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:295)
> at org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:282)
> at
> org.apache.flink.core.fs.CloseableRegistry.doClose(CloseableRegistry.java:65)
> at
> org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127)
> at
> org.apache.flink.runtime.state.AbstractKeyedStateBackend.close(AbstractKeyedStateBackend.java:424)
> at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:295)
> at org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:282)
> at
> org.apache.flink.core.fs.CloseableRegistry.doClose(CloseableRegistry.java:65)
> at
> org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127)
> at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:255)
> at
> org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:72)
> at
> org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUp(StreamTask.java:1061)
> at
> org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$1(Task.java:1039)
> at
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:1057)
> at
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:1039) //
> <<<<<<< (1) HERE we start to cleanUp and cancel state backend operations (as
> shown above)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:840) //
> <<<<<<< (2) if cleanUp/cancellation calls `failExternally` before we process
> the failure in `doRun`, the exception will be hidden
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:654)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)