Min Shen created SPARK-33574: -------------------------------- Summary: Improve locality for push-based shuffle especially for join like operations Key: SPARK-33574 URL: https://issues.apache.org/jira/browse/SPARK-33574 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Affects Versions: 3.1.0 Reporter: Min Shen
Currently, we only set locality for ShuffledRDD and ShuffledRowRDD with push-based shuffle. In simple stage DAGs where a ShuffledRDD or ShuffledRowRDD is the only input RDD, Spark can handle locality fine. However, if we have a join operation where a stage can consume multiple shuffle inputs or other non-shuffle inputs, the locality will take a hit with how DAGScheduler currently determines the preferred location. With push-based shuffle, we could potentially reuse the same set of merger locations across sibling ShuffleMapStages. This would enable a much better locality on the reducer stage side, where corresponding merged shuffle partitions for the multiple shuffle inputs are already colocated. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org