Re: [I] Reduce spilling overhead in Comet shuffle [datafusion-comet]

via GitHub Fri, 21 Feb 2025 15:22:53 -0800


andygrove commented on issue #1436:
URL: 
https://github.com/apache/datafusion-comet/issues/1436#issuecomment-2675809473


   One issue is that Comet native shuffle creates a new spill file for each 
call to spill, often creating > 10,000 files. Each file contains a partial 
batch per output partition.
   
   A more efficient approach would be to have one spill file per output 
partition and stream batches to those and then concatenate the files at the 
end. This is the approach taken by Spark in `BypassMergeSortShuffleWriter`:
   
   ```
   /**
    * This class implements sort-based shuffle's hash-style shuffle fallback 
path. This write path
    * writes incoming records to separate files, one file per reduce partition, 
then concatenates these
    * per-partition files to form a single output file, regions of which are 
served to reducers.
    * Records are not buffered in memory. It writes output in a format
    * that can be served / consumed via {@link 
org.apache.spark.shuffle.IndexShuffleBlockResolver}.
    * 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Reduce spilling overhead in Comet shuffle [datafusion-comet]

Reply via email to