[PR] Coalesce hash repartition batches using BatchCoalescer [datafusion]

via GitHub Fri, 06 Mar 2026 02:17:52 -0800


Dandandan opened a new pull request, #20741:
URL: https://github.com/apache/datafusion/pull/20741


   ## Summary
   
   In hash repartitioning, input batches are split into sub-batches per output 
partition. With many partitions, these sub-batches can be very small (e.g. 8192 
rows / 16 partitions = ~512 rows per partition). Previously each small 
sub-batch was materialized via `take_arrays` and sent immediately through the 
channel.
   
   This change:
   - Adds `hash_partition_indices` to `BatchPartitioner` that computes hash 
indices without materializing sub-batches
   - Uses Arrow's `BatchCoalescer` per output partition to accumulate rows via 
`push_batch_with_indices` until `target_batch_size` is reached
   - Extracts `send_batch` helper for the channel send logic
   - Flushes remaining buffered batches when input is exhausted
   
   **Benefits:**
   - Reduces channel traffic — fewer, larger batches instead of many small ones
   - Avoids intermediate `take_arrays` materialization (deferred to 
`BatchCoalescer`)
   - Output batches are properly sized to `target_batch_size`
   
   ## Test plan
   - All 41 repartition unit tests pass
   - All 6 repartition SQL logic tests pass
   - Coalesce tests pass
   - 0-column schema edge case handled (falls back to direct send)
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Coalesce hash repartition batches using BatchCoalescer [datafusion]

Reply via email to