Dandandan opened a new pull request, #21551: URL: https://github.com/apache/datafusion/pull/21551
## Which issue does this PR close? - N/A (performance improvement) ## Rationale for this change In `RepartitionExec`'s round-robin mode, batches are currently sent to partitions in strict sequential order. If a downstream consumer is slow, data piles up in that channel's buffer while other channels may be empty and idle. This causes unnecessary buffering and suboptimal throughput. ## What changes are included in this PR? - Added `is_empty()` method to `DistributionSender` to check if a channel's buffer is currently empty - Modified `pull_from_input` in `RepartitionExec`: in round-robin mode, before sending to the next partition in sequence, check if that channel has buffered data. If so, scan for an empty channel and send there instead. Falls back to the original partition if no empty channel is found. This makes round-robin repartitioning adaptive to varying consumer speeds while maintaining the same total data distribution. ## Are these changes tested? Yes, existing round-robin repartition tests are updated to validate total row counts across all partitions (rather than exact per-partition counts, which are now non-deterministic due to the adaptive behavior). ## Are there any user-facing changes? No API changes. Repartitioned data may be distributed differently across output partitions compared to before, but total row counts are preserved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
