[GitHub] spark pull request #21085: [SPARK-23948] Trigger mapstage's job listener in ...

squito Tue, 17 Apr 2018 07:03:58 -0700

GitHub user squito opened a pull request:

    https://github.com/apache/spark/pull/21085


    [SPARK-23948] Trigger mapstage's job listener in submitMissingTasks

    ## What changes were proposed in this pull request?
    
    SparkContext submitted a map stage from `submitMapStage` to `DAGScheduler`,
    `markMapStageJobAsFinished` is called only in 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933
 and 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);
    
    But think about below scenario:
    1. stage0 and stage1 are all `ShuffleMapStage` and stage1 depends on stage0;
    2. We submit stage1 by `submitMapStage`;
    3. When stage 1 running, `FetchFailed` happened, stage0 and stage1 got 
resubmitted as stage0_1 and stage1_1;
    4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, 
but stage1 is not inside `runningStages`. So even though all splits(including 
the speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
called;
    5. stage0_1 finished, stage1_1 starts running. When `submitMissingTasks`, 
there is no missing tasks. But in current code, job listener is not triggered.
    
    We should call the job listener for map stage in `5`.
    
    ## How was this patch tested?
    
    Not added yet.
    
    Author: jinxing <jinxing6...@126.com>
    
    Closes #21019 from jinxing64/SPARK-23948.
    
    (cherry picked from commit 3990daaf3b6ca2c5a9f7790030096262efb12cb2)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/squito/spark cp

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21085.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21085
    
----
commit 35e349f402ffd83a4eae31ffb848cd400595d9f7
Author: jinxing <jinxing6042@...>
Date:   2018-04-17T13:55:01Z

    [SPARK-23948] Trigger mapstage's job listener in submitMissingTasks
    
    ## What changes were proposed in this pull request?
    
    SparkContext submitted a map stage from `submitMapStage` to `DAGScheduler`,
    `markMapStageJobAsFinished` is called only in 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933
 and 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);
    
    But think about below scenario:
    1. stage0 and stage1 are all `ShuffleMapStage` and stage1 depends on stage0;
    2. We submit stage1 by `submitMapStage`;
    3. When stage 1 running, `FetchFailed` happened, stage0 and stage1 got 
resubmitted as stage0_1 and stage1_1;
    4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, 
but stage1 is not inside `runningStages`. So even though all splits(including 
the speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
called;
    5. stage0_1 finished, stage1_1 starts running. When `submitMissingTasks`, 
there is no missing tasks. But in current code, job listener is not triggered.
    
    We should call the job listener for map stage in `5`.
    
    ## How was this patch tested?
    
    Not added yet.
    
    Author: jinxing <jinxing6...@126.com>
    
    Closes #21019 from jinxing64/SPARK-23948.
    
    (cherry picked from commit 3990daaf3b6ca2c5a9f7790030096262efb12cb2)

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21085: [SPARK-23948] Trigger mapstage's job listener in ...

Reply via email to