wuyi created SPARK-24011: ---------------------------- Summary: Cache rdd's immediate parent ShuffleDependency to accelerate getShuffleDependencies() Key: SPARK-24011 URL: https://issues.apache.org/jira/browse/SPARK-24011 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: wuyi
When creating stages for jobs, we need to find a rdd's (except the final rdd) immediate parent ShuffleDependencies by method getShuffleDependencies() for at least 2 times (first in getMissingAncestorShuffleDependencies(), and second in getOrCreateParentStages()). So, we can cache the result at the fist time we call getShuffleDependencies(). This is helpful for cutting time consuming when there's many NarrowDependencies between the rdd and its immediate parent ShuffleDependencies or if the rdd has a number of immediate parent ShuffleDependencies . There's an exception for checkpointed rdd. If a rdd is checkpointed, it's immediate parent ShuffleDependencies should adjust to empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org