Dandandan opened a new pull request, #21677: URL: https://github.com/apache/datafusion/pull/21677
## Which issue does this PR close? <!-- Related to improving CPU utilization in parallel query execution (e.g. ClickBench). --> ## Rationale for this change In `RepartitionExec::pull_from_input`, each input task partitions an input batch and sends the resulting sub-batches to output channels. On every sub-batch, the input task: 1. Acquires the output partition's memory reservation `Mutex` and calls `try_grow` 2. Sends the batch through the distributor channel (which internally takes `channel.state.lock()` and potentially the gate mutex) When hash-partitioning an input batch of size `S` into `N` output partitions, each output gets a sub-batch of size `~S/N`. With many output partitions (e.g. 16 cores → 16 output partitions), these sub-batches are very small, so the per-batch synchronization cost dominates. ## What changes are included in this PR? Each input task now accumulates partitioned sub-batches in per-output-partition local buffers (`Vec<PartitionBuffer>`) and only flushes when a buffer reaches `FLUSH_THRESHOLD_BYTES` (256 KiB). Each flush performs a single `try_grow` call for the total buffered size instead of once per sub-batch. If `try_grow` fails, all buffered batches for that partition are spilled together. End-of-input triggers a final flush of any remaining buffered data per partition. ## Are these changes tested? Covered by the existing repartition test suite (41 tests pass), including the spilling, delayed-stream, dropped-output-stream, and ordering-preservation tests. ## Are there any user-facing changes? No. The change only affects batching granularity inside `RepartitionExec`; memory semantics, spilling behavior, and output ordering are preserved. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
