Eric Liang created SPARK-17370:
----------------------------------

             Summary: Shuffle service files not invalidated when a slave is lost
                 Key: SPARK-17370
                 URL: https://issues.apache.org/jira/browse/SPARK-17370
             Project: Spark
          Issue Type: Bug
            Reporter: Eric Liang


DAGScheduler invalidates shuffle files when an executor loss event occurs, but 
not when the external shuffle service is enabled. This is because when shuffle 
service is on, the shuffle file lifetime can exceed the executor lifetime.

However, it doesn't invalidate shuffle files when the shuffle service itself is 
lost (due to whole slave loss). This can cause long hangs when slaves are lost 
since the file loss is not detected until a subsequent stage attempts to read 
the shuffle files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to