[ https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wing Yew Poon updated SPARK-32003: ---------------------------------- Affects Version/s: 3.0.0 > Shuffle files for lost executor are not unregistered if fetch failure occurs > after executor is lost > --------------------------------------------------------------------------------------------------- > > Key: SPARK-32003 > URL: https://issues.apache.org/jira/browse/SPARK-32003 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 2.4.6, 3.0.0 > Reporter: Wing Yew Poon > Priority: Major > > A customer's cluster has a node that goes down while a Spark application is > running. (They are running Spark on YARN with the external shuffle service > enabled.) An executor is lost (apparently the only one running on the node). > This executor lost event is handled in the DAGScheduler, which removes the > executor from its BlockManagerMaster. At this point, there is no > unregistering of shuffle files for the executor or the node. Soon after, > tasks trying to fetch shuffle files output by that executor fail with > FetchFailed (because the node is down, there is no NodeManager available to > serve shuffle files). By right, such fetch failures should cause the shuffle > files for the executor to be unregistered, but they do not. > Due to task failure, the stage is re-attempted. Tasks continue to fail due to > fetch failure form the lost executor's shuffle output. This time, since the > failed epoch for the executor is higher, the executor is removed again (this > doesn't really do anything, the executor was already removed when it was > lost) and this time the shuffle output is unregistered. > So it takes two stage attempts instead of one to clear the shuffle output. We > get 4 attempts by default. The customer was unlucky and two nodes went down > during the stage, i.e., the same problem happened twice. So they used up 4 > stage attempts and the stage failed and thus the job. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org