alamb opened a new issue, #18779:
URL: https://github.com/apache/datafusion/issues/18779

   DataFusion currently has a standalone `CoalesceBatchesExec` operator that 
ensures batches are large enough to take advantage of vectorized execution
   
   This operator is inserted after "filter-like" operations that can produce 
small batches, such as `FilterExec`, `HashJoinExec`, and `RepartitionExec`.
   
   `CoalesceBatchesExec` is non ideal  as it can prevent other optimizations 
from happening (or they are more complicated than they otherwise would need to 
be as they need to know how to look through the operator). For example, see 
[here](https://github.com/apache/datafusion/blob/9029ff1834283a65988ed94eaa98a9387452d1bc/datafusion/physical-optimizer/src/optimizer.rs#L120-L119)
   
   A longstanding effort, has been trying to make the filter+concat faster, and 
we have introduced the 
[coalesce](https://docs.rs/arrow/latest/arrow/compute/kernels/coalesce/index.html)
 kernels upstream
   - https://github.com/apache/datafusion/issues/7957
   
   
   Related issues
   - [ ] https://github.com/apache/datafusion/issues/18606
   - [ ] https://github.com/apache/datafusion/issues/18646
   - [ ] https://github.com/apache/datafusion/issues/7001
   - [ ] https://github.com/apache/datafusion/issues/15478
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to