EnricoMi opened a new pull request, #51200:
URL: https://github.com/apache/spark/pull/51200

   ### What changes were proposed in this pull request?
   Adds options to retry `FileNotFoundException`s when opening files migrated 
to the fallback storage.
   
   - `STORAGE_DECOMMISSION_FALLBACK_STORAGE_REPLICATION_DELAY` sets the allowed 
replication delay.
   The executor waits at most this long for the shuffle data file to appear on 
the fallback storage
   - `STORAGE_DECOMMISSION_FALLBACK_STORAGE_REPLICATION_WAIT` sets an interval 
of re-attempts looking for the file
   
   ### Why are the changes needed?
   Using a distributed filesystem as the fallback storage for migrating shuffle 
data on executor decommissioning, executors that attempt to read the migrated 
data might not yet see the file that has been written by the decommissioned 
executor. This is called replication delay.
   
   Currently, executors give up instantly, even though they know the data have 
been successfully migrated to the fallback storage, from where they do not 
migrate further. Having the executor wait for a defined time and reattempt to 
open the file avoids a fetch failure and a re-computation of the migrated 
shuffle data.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Unit test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to