[ https://issues.apache.org/jira/browse/SPARK-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-14527. ------------------------------- Resolution: Duplicate Fix Version/s: (was: 1.6.1) > Job can't finish when restart all nodemanages with using external shuffle > services > ---------------------------------------------------------------------------------- > > Key: SPARK-14527 > URL: https://issues.apache.org/jira/browse/SPARK-14527 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core, YARN > Reporter: Weizhong > Priority: Minor > > 1) Submit a wordcount app > 2) Stop all nodenamages when running ShuffleMapStage > 3) After some minutes, start all nodemanages > Now, this job will failed at ResultStage and then retry ShuffleMapStage, and > then ResultStage failed again, it sill running in this loop, and can't finish > this job. > This is because when stop all NMs, all the containers are still alive, but > executors info will lost which stored on NM(YarnShuffleService), so even if > all the NMs recover, the tasks will failed on ResultStage when fetch shuffle > data. > {noformat} > 16/04/06 17:02:14 WARN TaskSetManager: Lost task 2.0 in stage 1.11 (TID 220, > spark-1): FetchFailed(BlockManagerId(3, 192.168.42.175, 27337), shuffleId=0, > mapId=4, reduceId=2, message= > org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: > Executor is not registered (appId=application_1459927459378_0005, execId=3) > ... > 16/04/06 17:02:14 INFO YarnScheduler: Removed TaskSet 1.11, whose tasks have > all completed, from pool > 16/04/06 17:02:14 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 (map at > wordcountWithSave.scala:21) and ResultStage 1 (saveAsTextFile at > wordcountWithSave.scala:32) due to fetch failure > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org