alamb opened a new issue, #18779: URL: https://github.com/apache/datafusion/issues/18779
DataFusion currently has a standalone `CoalesceBatchesExec` operator that ensures batches are large enough to take advantage of vectorized execution This operator is inserted after "filter-like" operations that can produce small batches, such as `FilterExec`, `HashJoinExec`, and `RepartitionExec`. `CoalesceBatchesExec` is non ideal as it can prevent other optimizations from happening (or they are more complicated than they otherwise would need to be as they need to know how to look through the operator). For example, see [here](https://github.com/apache/datafusion/blob/9029ff1834283a65988ed94eaa98a9387452d1bc/datafusion/physical-optimizer/src/optimizer.rs#L120-L119) A longstanding effort, has been trying to make the filter+concat faster, and we have introduced the [coalesce](https://docs.rs/arrow/latest/arrow/compute/kernels/coalesce/index.html) kernels upstream - https://github.com/apache/datafusion/issues/7957 Related issues - [ ] https://github.com/apache/datafusion/issues/18606 - [ ] https://github.com/apache/datafusion/issues/18646 - [ ] https://github.com/apache/datafusion/issues/7001 - [ ] https://github.com/apache/datafusion/issues/15478 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
