[jira] [Assigned] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-07-22 Thread Imran Rashid (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-32003:


Assignee: Wing Yew Poon

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wing Yew Poon
>Assignee: Wing Yew Poon
>Priority: Major
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32003:


Assignee: Apache Spark

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wing Yew Poon
>Assignee: Apache Spark
>Priority: Major
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32003:


Assignee: (was: Apache Spark)

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wing Yew Poon
>Priority: Major
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org