GitHub user Ngone51 opened a pull request: https://github.com/apache/spark/pull/21096
cache rdd's immediate parent ShuffleDependencies to accelerate getShuffleDependencies ## What changes were proposed in this pull request? When creating stages for jobs, we need to find a rdd's (except the final rdd) immediate parent ShuffleDependencies by method `getShuffleDependencies()` for at least 2 times (first in `getMissingAncestorShuffleDependencies()`, and second in `getOrCreateParentStages()`). So, we can cache the result at the fist time we call `getShuffleDependencies()`. This is helpful for cutting time consuming when there's many `NarrowDependencies` between the rdd and its immediate parent `ShuffleDependencies` or if the rdd has a number of immediate parent `ShuffleDependencies` . There's an exception for checkpointed rdd. If a rdd is checkpointed, it's immediate parent `ShuffleDependencies` should adjust to empty. ## How was this patch tested? exists. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Ngone51/spark SPARK-24011 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21096.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21096 ---- commit 59fb931135b7bc8fc1f516c39015f7412ae25208 Author: wuyi <ngone_5451@...> Date: 2018-04-18T10:35:22Z cache rdd's immediate ShuffleDependencies ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org