jizezhang commented on issue #18782:
URL: https://github.com/apache/datafusion/issues/18782#issuecomment-3559220080

   Hi @alamb wanted to check my understanding and discuss the approach a little 
bit. Please let me know your thoughts and correct me if anything sounds off.
   
   I thought about two places where we could potentially coalesce batches:
   - Coalesce hash-partitioned input batches for the same output partition in 
an input stream before sending to output channel, e.g. coalesce in 
`RepartitionExec::pull_from_input` before `channel.sender.send`.
     - For order preserving case, this would result in larger input batches for 
merge-sort, and as merge-sort needs to wait for batches from all input 
partition streams to be ready, I wonder if this would impact performance.
   - Coalesce batches in output stream when `poll_next` is called, e.g.
     - For non order-preserving case, output stream is a `PerPartitionStream`; 
for order-preserving case, output stream is a `SortPreservingMergeStream` whose 
input is vector of `PerPartitionStream`s.
     - If coalescing in e.g. `PerPartitionStream::poll_next_inner`, it also 
means larger input batches for merge-sort.
     - Alternatively, could create a wrapper stream around `PerPartitionStream` 
that coalesces while calling `PerPartitionStream::poll_next`. This would look 
very much like a `CoalesceBatches` but not as a separate node.
     - For order preserving case, maybe coalesce inside 
`SortPreservingMergeStream::poll_next_inner`.
   
   Thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to