EnricoMi opened a new pull request, #51200: URL: https://github.com/apache/spark/pull/51200
### What changes were proposed in this pull request? Adds options to retry `FileNotFoundException`s when opening files migrated to the fallback storage. - `STORAGE_DECOMMISSION_FALLBACK_STORAGE_REPLICATION_DELAY` sets the allowed replication delay. The executor waits at most this long for the shuffle data file to appear on the fallback storage - `STORAGE_DECOMMISSION_FALLBACK_STORAGE_REPLICATION_WAIT` sets an interval of re-attempts looking for the file ### Why are the changes needed? Using a distributed filesystem as the fallback storage for migrating shuffle data on executor decommissioning, executors that attempt to read the migrated data might not yet see the file that has been written by the decommissioned executor. This is called replication delay. Currently, executors give up instantly, even though they know the data have been successfully migrated to the fallback storage, from where they do not migrate further. Having the executor wait for a defined time and reattempt to open the file avoids a fetch failure and a re-computation of the migrated shuffle data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
