[jira] [Updated] (SPARK-17370) Shuffle service files not invalidated when a slave is lost
[ https://issues.apache.org/jira/browse/SPARK-17370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17370: --- Fix Version/s: 2.0.1 > Shuffle service files not invalidated when a slave is lost > -- > > Key: SPARK-17370 > URL: https://issues.apache.org/jira/browse/SPARK-17370 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.0.1, 2.1.0 > > > DAGScheduler invalidates shuffle files when an executor loss event occurs, > but not when the external shuffle service is enabled. This is because when > shuffle service is on, the shuffle file lifetime can exceed the executor > lifetime. > However, it doesn't invalidate shuffle files when the shuffle service itself > is lost (due to whole slave loss). This can cause long hangs when slaves are > lost since the file loss is not detected until a subsequent stage attempts to > read the shuffle files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17370) Shuffle service files not invalidated when a slave is lost
[ https://issues.apache.org/jira/browse/SPARK-17370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17370: --- Assignee: Eric Liang > Shuffle service files not invalidated when a slave is lost > -- > > Key: SPARK-17370 > URL: https://issues.apache.org/jira/browse/SPARK-17370 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Eric Liang >Assignee: Eric Liang > > DAGScheduler invalidates shuffle files when an executor loss event occurs, > but not when the external shuffle service is enabled. This is because when > shuffle service is on, the shuffle file lifetime can exceed the executor > lifetime. > However, it doesn't invalidate shuffle files when the shuffle service itself > is lost (due to whole slave loss). This can cause long hangs when slaves are > lost since the file loss is not detected until a subsequent stage attempts to > read the shuffle files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17370) Shuffle service files not invalidated when a slave is lost
[ https://issues.apache.org/jira/browse/SPARK-17370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17370: --- Component/s: Spark Core > Shuffle service files not invalidated when a slave is lost > -- > > Key: SPARK-17370 > URL: https://issues.apache.org/jira/browse/SPARK-17370 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Eric Liang > > DAGScheduler invalidates shuffle files when an executor loss event occurs, > but not when the external shuffle service is enabled. This is because when > shuffle service is on, the shuffle file lifetime can exceed the executor > lifetime. > However, it doesn't invalidate shuffle files when the shuffle service itself > is lost (due to whole slave loss). This can cause long hangs when slaves are > lost since the file loss is not detected until a subsequent stage attempts to > read the shuffle files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org