[I] Optimized spill file format [datafusion]

via GitHub Fri, 10 Jan 2025 06:24:16 -0800


alamb opened a new issue, #14078:
URL: https://github.com/apache/datafusion/issues/14078


   ### Is your feature request related to a problem or challenge?
   
   DataFusion spills data to local disk for processing datasets that do not fit 
in available memory, as illustrated in this comment:
   
https://github.com/apache/datafusion/blob/918ac1b30be026e1276d638a0dbc10002627e9d7/datafusion/physical-plan/src/sorts/sort.rs#L88-L205
   
   Here is the [code that handles spilling in sort and hash 
aggregates](https://github.com/apache/datafusion/blob/918ac1b30be026e1276d638a0dbc10002627e9d7/datafusion/physical-plan/src/spill.rs#L93-L110).
   
   The current version of DataFusion spills data to disk using the Arrow IPC 
format, which is correct and was easy to get working, but comes with non 
trivial overhead, as @andygrove  found in Comet:
   - https://github.com/apache/datafusion-comet/issues/1189
   
   Some potential sources of overhead are due to the validation applied to 
arrow IPC files (which may in general have come from untrusted sources) which 
is unecessary if the data was valid when written by DataFusion
   
   ### Describe the solution you'd like
   
   As part of improving DataFuison's performance of larger than memory 
datasets, I would like to consider adding a new optimized serialization format 
for use in spill files. This is similar to how we have added optimized (non 
Arrow) in memory storage formats for intermediate results in hash aggregation 
and others
   
   At a high level this would look like:
   - [ ] Add a benchmark for spilling files (maybe write/read data to/from a 
spill file of various sizes)
   - [ ] Add an optimized Reader/Writer
   - [ ] Update the SortExec and GroupByHashExec to use this new Reader/Writer
   
   ### Describe alternatives you've considered
   
   # Add a customized Reader/Writer
   @andygrove has a PR to add a customized BatchReader to comet that seems to 
offer significant performance improvements for shuffle reading/writing: 
   - https://github.com/apache/datafusion-comet/pull/1190
   
   We could potentially upstream this code into DataFusion
   
   
   # Optimize the Arrow IPC reader 
   Another option  would be to continue using the Arrow IPC format, but disable 
validation:
   - https://github.com/apache/arrow-rs/issues/3287
   -  https://github.com/apache/arrow-rs/issues/6933
   
   
   @totoroyyb actually has a recently PR to arrow-rs proposing this approach:
   - https://github.com/apache/arrow-rs/pull/6938
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Optimized spill file format [datafusion]

Reply via email to