Re: [PR] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]

via GitHub Mon, 14 Apr 2025 23:44:18 -0700


zhuqi-lucas commented on PR #15380:
URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2803881895


   It seems when we merge the sorted batch, we already using the interleave to 
merge the sorted indices, here is the code:
   
   ```rust
       /// Drains the in_progress row indexes, and builds a new RecordBatch 
from them
       ///
       /// Will then drop any batches for which all rows have been yielded to 
the output
       ///
       /// Returns `None` if no pending rows
       pub fn build_record_batch(&mut self) -> Result<Option<RecordBatch>> {
           if self.is_empty() {
               return Ok(None);
           }
   
           let columns = (0..self.schema.fields.len())
               .map(|column_idx| {
                   let arrays: Vec<_> = self
                       .batches
                       .iter()
                       .map(|(_, batch)| batch.column(column_idx).as_ref())
                       .collect();
                   Ok(interleave(&arrays, &self.indices)?)
               })
               .collect::<Result<Vec<_>>>()?;
   
           self.indices.clear();
   ```
   
   
   
   But this PR, we also concat some batches into one batch, do you mean we can 
also use the indices from each batch to one batch just like the merge phase?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]

Reply via email to