Min Shen created SPARK-33574:
--------------------------------

             Summary: Improve locality for push-based shuffle especially for 
join like operations
                 Key: SPARK-33574
                 URL: https://issues.apache.org/jira/browse/SPARK-33574
             Project: Spark
          Issue Type: Sub-task
          Components: Shuffle, Spark Core
    Affects Versions: 3.1.0
            Reporter: Min Shen


Currently, we only set locality for ShuffledRDD and ShuffledRowRDD with 
push-based shuffle.

In simple stage DAGs where a ShuffledRDD or ShuffledRowRDD is the only input 
RDD, Spark can handle locality fine. However, if we have a join operation where 
a stage can consume multiple shuffle inputs or other non-shuffle inputs, the 
locality will take a hit with how DAGScheduler currently determines the 
preferred location.

With push-based shuffle, we could potentially reuse the same set of merger 
locations across sibling ShuffleMapStages. This would enable a much better 
locality on the reducer stage side, where corresponding merged shuffle 
partitions for the multiple shuffle inputs are already colocated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to