liupc opened a new pull request #23647: [SPARK-26712]Support multi directories 
for executor shuffle info recovery in yarn shuffle serivce
URL: https://github.com/apache/spark/pull/23647
 
 
   ## What changes were proposed in this pull request?
   
   Currently, `ExecutorShuffleInfo` can be recovered from file if NM recovery 
enabled, however, the recovery file is under a single directory, which may be 
unavailable if disk broken. So if a NM restart happen(may be caused by kill or 
some reason), the shuffle service can not start and the `ExecutorShuffleInfo` 
would lost even if there are existing executors on the node.
   
   This may finally cause job failures(if node or executors on it not 
blacklisted), or at least, it will cause resource waste.(shuffle from this node 
always failed.), for long running spark applications, this problem may be more 
serious.
   
   This PR introduced a mechanism to support multi directories for executor 
shuffle info recovery, this can improve the robustness of the 
`YarnShuffleService`.
   
   ## How was this patch tested?
   
   UT
   
   Please review http://spark.apache.org/contributing.html before opening a 
pull request.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to