[ 
https://issues.apache.org/jira/browse/SPARK-13931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-13931.
------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.2.0

> Resolve stage hanging up problem in a particular case
> -----------------------------------------------------
>
>                 Key: SPARK-13931
>                 URL: https://issues.apache.org/jira/browse/SPARK-13931
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 1.4.1, 1.5.2, 1.6.0, 1.6.1
>            Reporter: ZhengYaofeng
>             Fix For: 2.2.0
>
>
> Suppose the following steps:
> 1. Open speculation switch in the application. 
> 2. Run this app and suppose last task of shuffleMapStage 1 finishes. Let's 
> get the record straight, from the eyes of DAG, this stage really finishes, 
> and from the eyes of TaskSetManager, variable 'isZombie' is set to true, but 
> variable runningTasksSet isn't empty because of speculation.
> 3. Suddenly, executor 3 is lost. TaskScheduler receiving this signal, invokes 
> all executorLost functions of rootPool's taskSetManagers. DAG receiving this 
> signal, removes all this executor's outputLocs.
> 4. TaskSetManager adds all this executor's tasks to pendingTasks and tells 
> DAG they will be resubmitted (Attention: possibly not on time).
> 5. DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and 
> going to find that shuffleMapStage 1 is its missing parent because some 
> outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage 
> 1 again.
> 6. DAG still receives Task 'Resubmitted' signal from old taskSetManager, and 
> increases the number of pendingTasks of shuffleMapStage 1 each time. However, 
> old taskSetManager won't resolve new task to submit because its variable 
> 'isZombie' is set to true.
> 7. Finally shuffleMapStage 1 never finishes in DAG together with all stages 
> depending on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to