liupc opened a new pull request #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce URL: https://github.com/apache/spark/pull/23647 ## What changes were proposed in this pull request? Currently, `ExecutorShuffleInfo` can be recovered from file if NM recovery enabled, however, the recovery file is under a single directory, which may be unavailable if disk broken. So if a NM restart happen(may be caused by kill or some reason), the shuffle service can not start and the `ExecutorShuffleInfo` would lost even if there are existing executors on the node. This may finally cause job failures(if node or executors on it not blacklisted), or at least, it will cause resource waste.(shuffle from this node always failed.), for long running spark applications, this problem may be more serious. This PR introduced a mechanism to support multi directories for executor shuffle info recovery, this can improve the robustness of the `YarnShuffleService`. ## How was this patch tested? UT Please review http://spark.apache.org/contributing.html before opening a pull request.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org