[ 
https://issues.apache.org/jira/browse/SPARK-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weizhong updated SPARK-14527:
-----------------------------
    Description: 
1) Submit a wordcount app
2) Stop all nodenamages when running ShuffleMapStage
3) After some minutes, start all nodemanages

Now, this job will failed at ResultStage and then retry ShuffleMapStage, and 
then ResultStage failed again, it sill running in this loop, and can't finish 
this job.

This is because when stop all NMs, all the containers are still alive, but 
executors info will lost which stored on NM(YarnShuffleService), so even if all 
the NMs recover, the tasks will failed on ResultStage when fetch shuffle data.
{noformat}
16/04/06 17:02:14 WARN TaskSetManager: Lost task 2.0 in stage 1.11 (TID 220, 
spark-1): FetchFailed(BlockManagerId(3, 192.168.42.175, 27337), shuffleId=0, 
mapId=4, reduceId=2, message=
org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: 
Executor is not registered (appId=application_1459927459378_0005, execId=3)
...
16/04/06 17:02:14 INFO YarnScheduler: Removed TaskSet 1.11, whose tasks have 
all completed, from pool
16/04/06 17:02:14 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 (map at 
wordcountWithSave.scala:21) and ResultStage 1 (saveAsTextFile at 
wordcountWithSave.scala:32) due to fetch failure
{noformat}

  was:
1) Submit a wordcount app
2) Stop all nodenamages when running ShuffleMapStage
3) After some minutes, start all nodemanages

Now, this job will failed at ResultStage and then retry ShuffleMapStage, and 
then ResultStage failed again, it sill running in this loop, and can't finish 
this job.

This is because when stop all NMs, all the Containers are still alive, but 
executors info will lost which stored on NM(YarnShuffleService), so even if all 
the NMs recover, the tasks will failed on ResultStage when fetch shuffle data.
{noformat}
16/04/06 17:02:14 WARN TaskSetManager: Lost task 2.0 in stage 1.11 (TID 220, 
spark-1): FetchFailed(BlockManagerId(3, 192.168.42.175, 27337), shuffleId=0, 
mapId=4, reduceId=2, message=
org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: 
Executor is not registered (appId=application_1459927459378_0005, execId=3)
...
16/04/06 17:02:14 INFO YarnScheduler: Removed TaskSet 1.11, whose tasks have 
all completed, from pool
16/04/06 17:02:14 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 (map at 
wordcountWithSave.scala:21) and ResultStage 1 (saveAsTextFile at 
wordcountWithSave.scala:32) due to fetch failure
{noformat}


> Job can't finish when restart all nodemanages with using external shuffle 
> services
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-14527
>                 URL: https://issues.apache.org/jira/browse/SPARK-14527
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core, YARN
>            Reporter: Weizhong
>            Priority: Minor
>
> 1) Submit a wordcount app
> 2) Stop all nodenamages when running ShuffleMapStage
> 3) After some minutes, start all nodemanages
> Now, this job will failed at ResultStage and then retry ShuffleMapStage, and 
> then ResultStage failed again, it sill running in this loop, and can't finish 
> this job.
> This is because when stop all NMs, all the containers are still alive, but 
> executors info will lost which stored on NM(YarnShuffleService), so even if 
> all the NMs recover, the tasks will failed on ResultStage when fetch shuffle 
> data.
> {noformat}
> 16/04/06 17:02:14 WARN TaskSetManager: Lost task 2.0 in stage 1.11 (TID 220, 
> spark-1): FetchFailed(BlockManagerId(3, 192.168.42.175, 27337), shuffleId=0, 
> mapId=4, reduceId=2, message=
> org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: 
> Executor is not registered (appId=application_1459927459378_0005, execId=3)
> ...
> 16/04/06 17:02:14 INFO YarnScheduler: Removed TaskSet 1.11, whose tasks have 
> all completed, from pool
> 16/04/06 17:02:14 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 (map at 
> wordcountWithSave.scala:21) and ResultStage 1 (saveAsTextFile at 
> wordcountWithSave.scala:32) due to fetch failure
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to