GitHub user GavinGavinNo1 reopened a pull request:

    https://github.com/apache/spark/pull/16855

    [SPARK-13931] Resolve stage hanging up problem in a particular case

    ## What changes were proposed in this pull request?
    When function 'executorLost' is invoked in class 'TaskSetManager', it's 
significant to judge whether variable 'isZombie' is set to true.
    
    This pull request fixes the following hang:
    
    1.Open speculation switch in the application.
    2.Run this app and suppose last task of shuffleMapStage 1 finishes. Let's 
get the record straight, from the eyes of DAG, this stage really finishes, and 
from the eyes of TaskSetManager, variable 'isZombie' is set to true, but 
variable runningTasksSet isn't empty because of speculation.
    3.Suddenly, executor 3 is lost. TaskScheduler receiving this signal, 
invokes all executorLost functions of rootPool's taskSetManagers. DAG receiving 
this signal, removes all this executor's outputLocs.
    4.TaskSetManager adds all this executor's tasks to pendingTasks and tells 
DAG they will be resubmitted (Attention: possibly not on time).
    5.DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and 
going to find that shuffleMapStage 1 is its missing parent because some 
outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage 1 
again.
    6.DAG still receives Task 'Resubmitted' signal from old taskSetManager, and 
increases the number of pendingTasks of shuffleMapStage 1 each time. However, 
old taskSetManager won't resolve new task to submit because its variable 
'isZombie' is set to true.
    7.Finally shuffleMapStage 1 never finishes in DAG together with all stages 
depending on it.
    
    ## How was this patch tested?
    
    It's quite difficult to construct test cases.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/GavinGavinNo1/spark resolve-stage-blocked2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16855.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16855
    
----
commit e15b2abedb6fcaf6bac8775f15bdd246fa22902e
Author: GavinGavinNo1 <gavingavin...@gmail.com>
Date:   2017-02-08T14:51:59Z

    Resolve stage hanging up problem in a particular case

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to