ZhengYaofeng created SPARK-13931:
------------------------------------

             Summary: Resolve stage hanging up problem in a particular case
                 Key: SPARK-13931
                 URL: https://issues.apache.org/jira/browse/SPARK-13931
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: 1.6.1, 1.6.0, 1.5.2, 1.4.1
            Reporter: ZhengYaofeng


Suppose the following steps:
1. Open speculation switch in the application. 
2. Run this app and suppose last task of shuffleMapStage 1 finishes. Let's get 
the record straight, from the eyes of DAG, this stage really finishes, and from 
the eyes of TaskSetManager, variable 'isZombie' is set to true, but variable 
runningTasksSet isn't empty because of speculation.
3. Suddenly, executor 3 is lost. TaskScheduler receiving this signal, invokes 
all executorLost functions of rootPool's taskSetManagers. DAG receiving this 
signal, removes all this executor's outputLocs.
4. TaskSetManager adds all this executor's tasks to pendingTasks and tells DAG 
they will be resubmitted (Attention: possibly not on time).
5. DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and 
going to find that shuffleMapStage 1 is its missing parent because some 
outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage 1 
again.
6. DAG still receives Task 'Resubmitted' signal from old taskSetManager, and 
increases the number of pendingTasks of shuffleMapStage 1 each time. However, 
old taskSetManager won't resolve new task to submit because its variable 
'isZombie' is set to true.
7. Finally shuffleMapStage 1 never finishes in DAG together with all stages 
depending on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to