andygrove opened a new pull request, #3234: URL: https://github.com/apache/datafusion-comet/pull/3234
## Summary This PR adds batch coalescing before shuffle writes to reduce per-batch overhead and improve vectorization efficiency. When enabled, small columnar batches are combined until they reach the target batch size before being processed by the shuffle writer. **Key changes:** - Added `spark.comet.shuffle.resizeBatches.input` config to enable coalescing batches before shuffle write - Added `spark.comet.shuffle.resizeBatches.output` config for coalescing after shuffle read - Native planner wraps shuffle input with DataFusion's `CoalesceBatchesExec` when input coalescing is enabled - Added `CometBatchCoalescer` Scala class for output-side batch coalescing **Performance benefits observed in TPC-H Q18 benchmarks:** - 10.9% overall query time improvement - Significantly reduced GC pressure (e.g., Stage 26: GC time dropped from 3,602ms to 56ms) - Better vectorization efficiency for downstream operators ## Test plan - [ ] Verify existing unit tests pass - [ ] Run TPC-H Q18 benchmark with `spark.comet.shuffle.resizeBatches.input=true` - [ ] Verify GC metrics improve with the optimization enabled - [ ] Test with various batch sizes to ensure correct behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
