Re: [PR] Fix: eliminate unnecessary repartitioning for small datasets [datafusion]

via GitHub Tue, 18 Nov 2025 13:12:45 -0800


ShashidharM0118 commented on PR #18776:
URL: https://github.com/apache/datafusion/pull/18776#issuecomment-3549527808


   @martin-g, Thanks for the review!
   
   I made these changes:
   - Switched to `stats.num_rows.get_value()` instead of 
`Precision::Exact(num_rows)` 
   - Added check for `round_robin_repartition()` to respect when users want 
extra parallelism  
   - Added logic to get statistics, defaulting to repartitioning when stats 
aren't available
   
   I set the threshold to `10 * batch_size`. IMO, if the dataset size is only 
"in and around" a single batch size, distributing it creates "micro-batches" 
and causes unnecessary overhead. I am not entirely sure if this is the best 
value, so let me know your thoughts.
   
   ---


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix: eliminate unnecessary repartitioning for small datasets [datafusion]

Reply via email to