Dandandan opened a new pull request, #21677:
URL: https://github.com/apache/datafusion/pull/21677

   ## Which issue does this PR close?
   
   <!-- Related to improving CPU utilization in parallel query execution (e.g. 
ClickBench). -->
   
   ## Rationale for this change
   
   In `RepartitionExec::pull_from_input`, each input task partitions an input 
batch and sends the resulting sub-batches to output channels. On every 
sub-batch, the input task:
   1. Acquires the output partition's memory reservation `Mutex` and calls 
`try_grow`
   2. Sends the batch through the distributor channel (which internally takes 
`channel.state.lock()` and potentially the gate mutex)
   
   When hash-partitioning an input batch of size `S` into `N` output 
partitions, each output gets a sub-batch of size `~S/N`. With many output 
partitions (e.g. 16 cores → 16 output partitions), these sub-batches are very 
small, so the per-batch synchronization cost dominates.
   
   ## What changes are included in this PR?
   
   Each input task now accumulates partitioned sub-batches in 
per-output-partition local buffers (`Vec<PartitionBuffer>`) and only flushes 
when a buffer reaches `FLUSH_THRESHOLD_BYTES` (256 KiB). Each flush performs a 
single `try_grow` call for the total buffered size instead of once per 
sub-batch. If `try_grow` fails, all buffered batches for that partition are 
spilled together.
   
   End-of-input triggers a final flush of any remaining buffered data per 
partition.
   
   ## Are these changes tested?
   
   Covered by the existing repartition test suite (41 tests pass), including 
the spilling, delayed-stream, dropped-output-stream, and ordering-preservation 
tests.
   
   ## Are there any user-facing changes?
   
   No. The change only affects batching granularity inside `RepartitionExec`; 
memory semantics, spilling behavior, and output ordering are preserved.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to