Re: [I] Optimized spill file format [datafusion]

via GitHub Fri, 10 Jan 2025 19:24:46 -0800


2010YOUY01 commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585043689


   Although we're currently spilling column-wise record batches, I think this 
will change to row-wise batches in the future. It would be better to benchmark 
and optimize spilling the Arrow Row format in this issue as well.
   
   The reason is that the spilling operation involves sorting, spilling sorted 
runs, and reading back those runs for merging. Both sorting and merging benefit 
from the row format. The current implementation performs several unnecessary 
conversions between row and column formats, which could become inefficient.
   The preferred way should be:
   1. Convert to row format and do sorting
   2. Maintaining the row format until the final output (I believe `Sort` will 
benefit more from it, because it will do 2 phase or merging, `Aggregate` will 
only do 1 phase)
   This is tracked by https://github.com/apache/datafusion/issues/7053


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Optimized spill file format [datafusion]

Reply via email to