[
https://issues.apache.org/jira/browse/SPARK-17370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Rosen updated SPARK-17370:
-------------------------------
Assignee: Eric Liang
> Shuffle service files not invalidated when a slave is lost
> ----------------------------------------------------------
>
> Key: SPARK-17370
> URL: https://issues.apache.org/jira/browse/SPARK-17370
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Reporter: Eric Liang
> Assignee: Eric Liang
>
> DAGScheduler invalidates shuffle files when an executor loss event occurs,
> but not when the external shuffle service is enabled. This is because when
> shuffle service is on, the shuffle file lifetime can exceed the executor
> lifetime.
> However, it doesn't invalidate shuffle files when the shuffle service itself
> is lost (due to whole slave loss). This can cause long hangs when slaves are
> lost since the file loss is not detected until a subsequent stage attempts to
> read the shuffle files.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]