[ https://issues.apache.org/jira/browse/SPARK-17370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Liang updated SPARK-17370: ------------------------------- Component/s: Spark Core > Shuffle service files not invalidated when a slave is lost > ---------------------------------------------------------- > > Key: SPARK-17370 > URL: https://issues.apache.org/jira/browse/SPARK-17370 > Project: Spark > Issue Type: Bug > Components: Spark Core > Reporter: Eric Liang > > DAGScheduler invalidates shuffle files when an executor loss event occurs, > but not when the external shuffle service is enabled. This is because when > shuffle service is on, the shuffle file lifetime can exceed the executor > lifetime. > However, it doesn't invalidate shuffle files when the shuffle service itself > is lost (due to whole slave loss). This can cause long hangs when slaves are > lost since the file loss is not detected until a subsequent stage attempts to > read the shuffle files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org