GitHub user Ngone51 opened a pull request:

    https://github.com/apache/spark/pull/21096

    cache rdd's immediate parent ShuffleDependencies to accelerate 
getShuffleDependencies

    ## What changes were proposed in this pull request?
    
    When creating stages for jobs, we need to find a rdd's (except the final 
rdd) immediate parent ShuffleDependencies by method `getShuffleDependencies()` 
for at least 2 times (first in
    `getMissingAncestorShuffleDependencies()`, and second in 
`getOrCreateParentStages()`).
    
    So, we can cache the result at the fist time we call 
`getShuffleDependencies()`.
    This is helpful for cutting time consuming when there's many 
`NarrowDependencies` between the rdd and its immediate parent 
`ShuffleDependencies` or if the rdd has a number of immediate parent 
`ShuffleDependencies` .
     
    There's an exception for checkpointed rdd. If a rdd is checkpointed, it's 
immediate parent `ShuffleDependencies` should adjust to empty.
    ## How was this patch tested?
    
    exists.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Ngone51/spark SPARK-24011

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21096.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21096
    
----
commit 59fb931135b7bc8fc1f516c39015f7412ae25208
Author: wuyi <ngone_5451@...>
Date:   2018-04-18T10:35:22Z

    cache rdd's immediate ShuffleDependencies

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to