[ 
https://issues.apache.org/jira/browse/FLINK-39998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092906#comment-18092906
 ] 

Raj commented on FLINK-39998:
-----------------------------

Awesome. Thanks. 

>  RocksDBStateDownloader loses parallel download failures, surfacing a 
> misleading ClosedChannelException instead of the real root cause
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39998
>                 URL: https://issues.apache.org/jira/browse/FLINK-39998
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / State Backends
>    Affects Versions: 2.0.0, 1.20.5, 2.1.3
>            Reporter: Raj
>            Assignee: Raj
>            Priority: Major
>              Labels: pull-request-available, starter
>             Fix For: 2.4.0
>
>         Attachments: Screenshot 2026-06-26 at 3.23.42 PM.png
>
>
> When RocksDBStateDownloader.transferAllStateDataToDirectory fails during 
> incremental state restore, the exception surfaced to the operator is always 
> ClosedChannelException — regardless of the actual root cause. This makes 
> diagnosing restore failures extremely difficult.
> *Example scenario:* A checkpoint file is accidentally deleted from S3. On job 
> restart, Flink keeps crashing with:
>   Caused by: java.io.IOException: java.nio.channels.ClosedChannelException
>       at 
> org.apache.flink.state.rocksdb.RocksDBStateDownloader.downloadDataForStateHandle
> There is no indication of which file is missing, which checkpoint is 
> affected, or that S3 is involved at all.
>   
> *Root cause:*
>   FutureUtils.completeAll() collects all parallel thread failures as 
> suppressed exceptions on the first exception to arrive (CompletionException). 
> However, CompletableFuture.get()
>   internally calls JDK's reportGet() which strips the CompletionException 
> wrapper before throwing ExecutionException:
>   // JDK CompletableFuture.reportGet()
>   if (x instanceof CompletionException && x.getCause() != null)
>       x = cause;  // strips CompletionException — suppressed list GONE
>   throw new ExecutionException(x);
>   By the time the catch block in transferAllStateDataToDirectory runs, all 
> thread failures except one are permanently lost. Which failure "wins" is 
> non-deterministic — it is whichever
> thread completes first, which is typically the cascade ClosedChannelException 
> (a local operation, fast) rather than the real cause (e.g. 
> FileNotFoundException from a remote storage call).
>   
>      



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to