[jira] [Commented] (SPARK-52507) Quick fallback to fallback storage on fetch failure

Aparna Garg (Jira) Fri, 19 Sep 2025 09:49:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-52507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021455#comment-18021455
 ]


Aparna Garg commented on SPARK-52507:
-------------------------------------

User 'EnricoMi' has created a pull request for this issue:
https://github.com/apache/spark/pull/51202

> Quick fallback to fallback storage on fetch failure
> ---------------------------------------------------
>
>                 Key: SPARK-52507
>                 URL: https://issues.apache.org/jira/browse/SPARK-52507
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Kubernetes
>    Affects Versions: 4.1.0
>            Reporter: Enrico Minack
>            Priority: Major
>
> Using the fallback storage with storage decommissioning on Kubernetes can run 
> into the situation where some tasks try to read from an executor that has 
> just been decommissioned. The driver has updated location information of the 
> migrated shuffle data, but the task uses the outdated location.
> Given we have the fallback storage enabled and shuffle data is always 
> migrated to the fallback storage only (SPARK-52506), it is very likely that a 
> fetch failure can be recovered from the fallback storage. The task does not 
> need to go through a fetch failure to restart the task or stage to get hold 
> of the update shuffle data location.
> This benefits from
> 1. connections to decommissioned executors to quickly fail (connection 
> refused rather connection timeout), see SPARK-52505
> 2. storage migration only migrates to the fallback storage, see SPARK-52506



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-52507) Quick fallback to fallback storage on fetch failure

Reply via email to