alamb opened a new issue, #14078: URL: https://github.com/apache/datafusion/issues/14078
### Is your feature request related to a problem or challenge? DataFusion spills data to local disk for processing datasets that do not fit in available memory, as illustrated in this comment: https://github.com/apache/datafusion/blob/918ac1b30be026e1276d638a0dbc10002627e9d7/datafusion/physical-plan/src/sorts/sort.rs#L88-L205 Here is the [code that handles spilling in sort and hash aggregates](https://github.com/apache/datafusion/blob/918ac1b30be026e1276d638a0dbc10002627e9d7/datafusion/physical-plan/src/spill.rs#L93-L110). The current version of DataFusion spills data to disk using the Arrow IPC format, which is correct and was easy to get working, but comes with non trivial overhead, as @andygrove found in Comet: - https://github.com/apache/datafusion-comet/issues/1189 Some potential sources of overhead are due to the validation applied to arrow IPC files (which may in general have come from untrusted sources) which is unecessary if the data was valid when written by DataFusion ### Describe the solution you'd like As part of improving DataFuison's performance of larger than memory datasets, I would like to consider adding a new optimized serialization format for use in spill files. This is similar to how we have added optimized (non Arrow) in memory storage formats for intermediate results in hash aggregation and others At a high level this would look like: - [ ] Add a benchmark for spilling files (maybe write/read data to/from a spill file of various sizes) - [ ] Add an optimized Reader/Writer - [ ] Update the SortExec and GroupByHashExec to use this new Reader/Writer ### Describe alternatives you've considered # Add a customized Reader/Writer @andygrove has a PR to add a customized BatchReader to comet that seems to offer significant performance improvements for shuffle reading/writing: - https://github.com/apache/datafusion-comet/pull/1190 We could potentially upstream this code into DataFusion # Optimize the Arrow IPC reader Another option would be to continue using the Arrow IPC format, but disable validation: - https://github.com/apache/arrow-rs/issues/3287 - https://github.com/apache/arrow-rs/issues/6933 @totoroyyb actually has a recently PR to arrow-rs proposing this approach: - https://github.com/apache/arrow-rs/pull/6938 ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org