[ https://issues.apache.org/jira/browse/FLINK-38180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-38180: ----------------------------------- Labels: pull-request-available (was: ) > Race condition between failing Task and cancelation hiding the real exception > ----------------------------------------------------------------------------- > > Key: FLINK-38180 > URL: https://issues.apache.org/jira/browse/FLINK-38180 > Project: Flink > Issue Type: Bug > Components: Runtime / Task > Affects Versions: 1.14.0, 2.0.0, 1.15.4, 1.16.3, 1.17.2, 1.18.1, 1.19.3, > 1.20.2 > Reporter: Piotr Nowojski > Assignee: Piotr Nowojski > Priority: Major > Labels: pull-request-available > > When task fails, during {{Task#restoreAndInvoke}} invocation we will call > {{finalInvokable.cleanUp(throwable)}}, which ultimately can cancel some state > backend operations. This cancelation can then report > {{Task#failExternally(CancelTaskException)}} before the task thread manages > to switch to {{FAILED}} state and properly set the {{failureCause}}. When > that happens, task switches from {{RUNNING}} to {{FAILED}}: > {noformat} > switched from RUNNING to FAILED due to CancelTaskException. > {noformat} > from that async cancellation. > Later, when task thread processes the real failure, Task is already in > {{FAILED}} state, so the real failure is being ignored. > Example stack trace of Task thread with the race condition: > {noformat} > java.lang.Exception: StackTrace > at > org.apache.flink.contrib.streaming.state.RocksDBIncrementalCheckpointUtils.lambda$createAsyncRangeCompactionTask$0(RocksDBIncrementalCheckpointUtils.java:288) > at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:295) > at org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:282) > at > org.apache.flink.core.fs.CloseableRegistry.doClose(CloseableRegistry.java:65) > at > org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127) > at > org.apache.flink.runtime.state.AbstractKeyedStateBackend.close(AbstractKeyedStateBackend.java:424) > at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:295) > at org.apache.flink.util.IOUtils.closeAllQuietly(IOUtils.java:282) > at > org.apache.flink.core.fs.CloseableRegistry.doClose(CloseableRegistry.java:65) > at > org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127) > at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:255) > at > org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:72) > at > org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUp(StreamTask.java:1061) > at > org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$1(Task.java:1039) > at > org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:1057) > at > org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:1039) // > <<<<<<< (1) HERE we start to cleanUp and cancel state backend operations (as > shown above) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:840) // > <<<<<<< (2) if cleanUp/cancellation calls `failExternally` before we process > the failure in `doRun`, the exception will be hidden > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:654) > at java.base/java.lang.Thread.run(Thread.java:829) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)