[GitHub] [arrow-datafusion] jaylmiller commented on issue #5230: Use Arrow Row Format in SortExec

via GitHub Sat, 04 Mar 2023 13:56:38 -0800


jaylmiller commented on issue #5230:
URL: 
https://github.com/apache/arrow-datafusion/issues/5230#issuecomment-1454899537


   @ozankabak Not quite, the batches are not coalesced until the very end of 
the output. 
   
   This is the process for a single partition, with `M` batches (i.e. each 
batch is sized `N/M`):
   1. For each batch that streams in, sort that batch individually and place it 
into a buffer. If row encoding was used, the row encoding is also buffered 
(along side the RecordBatch).
   2. Once the input stream has completed, merge all the buffered batches (the 
row encodings from step 1 can be used here if they exist).
   
   So we're actually performing `M` row conversions (and `M` sorts), and then 
doing a final merge/sort which does not perform any row encoding (reused).
   
   The docs for the sort implementation on main also describes the algorithm:
   
https://github.com/apache/arrow-datafusion/blob/c37ddf72ec539bd39cce0dd4ff38db2e36ddb55f/datafusion/core/src/physical_plan/sorts/sort.rs#L64-L73


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] jaylmiller commented on issue #5230: Use Arrow Row Format in SortExec

Reply via email to