jaylmiller commented on issue #5230: URL: https://github.com/apache/arrow-datafusion/issues/5230#issuecomment-1454899537
@ozankabak Not quite, the batches are not coalesced until the very end of the output. This is the process for a single partition, with `M` batches (i.e. each batch is sized `N/M`): 1. For each batch that streams in, sort that batch individually and place it into a buffer. If row encoding was used, the row encoding is also buffered (along side the RecordBatch). 2. Once the input stream has completed, merge all the buffered batches (the row encodings from step 1 can be used here if they exist). So we're actually performing `M` row conversions (and `M` sorts), and then doing a final merge/sort which does not perform any row encoding (reused). The docs for the sort implementation on main also describes the algorithm: https://github.com/apache/arrow-datafusion/blob/c37ddf72ec539bd39cce0dd4ff38db2e36ddb55f/datafusion/core/src/physical_plan/sorts/sort.rs#L64-L73 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org