[ 
https://issues.apache.org/jira/browse/SPARK-46512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46512:
-------------------------------------------

    Assignee: Chenyu Zheng

> Optimize shuffle reading when both sort and combine are used.
> -------------------------------------------------------------
>
>                 Key: SPARK-46512
>                 URL: https://issues.apache.org/jira/browse/SPARK-46512
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>    Affects Versions: 4.0.0
>            Reporter: Chenyu Zheng
>            Assignee: Chenyu Zheng
>            Priority: Minor
>              Labels: pull-request-available
>
> After the shuffle reader obtains the block, it will first perform a combine 
> operation, and then perform a sort operation. It is known that both combine 
> and sort may generate temporary files, so the performance may be poor when 
> both sort and combine are used. In fact, combine operations can be performed 
> during the sort process, and we can avoid the combine spill file.
>  
> I did not find any direct api to construct the shuffle which both sort and 
> combine is used. But I can do like following code, here is a wordcount, and 
> the output words is sorted.
> {code:java}
> sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
> reduceByKey(_ + _, 1).
> asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
> collect().foreach(println) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to