EnricoMi opened a new pull request, #51202: URL: https://github.com/apache/spark/pull/51202
### What changes were proposed in this pull request? On the presence of a fallback storage, `ShuffleBlockFetcherIterator` seeing a fetch failure can optimistically try to read a block from the fallback storage, as it might have been migrated from a decommissioned executor to the fallback storage. If storage migration happens **only** to the fallback storage (#51201), then this assumption is even more optimistic. Note: This optimistic attempt to find the missing shuffle data on the fallback storage would collide with some replication delay handled in #51200. ### Why are the changes needed? In a Kubernetes environment, executors may be decommissioned. With a fallback storage configured, shuffle data will be migrated to other executors or the fallback storage. Tasks that start during a decommissioning phase of another executor might read blocks from that executor after it has been decommissioned. The task does not know the new location of the migrated block. Given a fallback storage is configured, it could optimistically try to read the block from the fallback storage. This avoids a stage retry, which otherwise is an expensive way to fetch the current block address after a block migration. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test and manual testing in a [Kubernetes setup](https://gist.github.com/EnricoMi/e9daa1176bce4c1211af3f3c5848112a). ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
