[
https://issues.apache.org/jira/browse/SPARK-52507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021455#comment-18021455
]
Aparna Garg commented on SPARK-52507:
-------------------------------------
User 'EnricoMi' has created a pull request for this issue:
https://github.com/apache/spark/pull/51202
> Quick fallback to fallback storage on fetch failure
> ---------------------------------------------------
>
> Key: SPARK-52507
> URL: https://issues.apache.org/jira/browse/SPARK-52507
> Project: Spark
> Issue Type: Sub-task
> Components: Kubernetes
> Affects Versions: 4.1.0
> Reporter: Enrico Minack
> Priority: Major
>
> Using the fallback storage with storage decommissioning on Kubernetes can run
> into the situation where some tasks try to read from an executor that has
> just been decommissioned. The driver has updated location information of the
> migrated shuffle data, but the task uses the outdated location.
> Given we have the fallback storage enabled and shuffle data is always
> migrated to the fallback storage only (SPARK-52506), it is very likely that a
> fetch failure can be recovered from the fallback storage. The task does not
> need to go through a fetch failure to restart the task or stage to get hold
> of the update shuffle data location.
> This benefits from
> 1. connections to decommissioned executors to quickly fail (connection
> refused rather connection timeout), see SPARK-52505
> 2. storage migration only migrates to the fallback storage, see SPARK-52506
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]