ShashidharM0118 commented on PR #18776: URL: https://github.com/apache/datafusion/pull/18776#issuecomment-3549527808
@martin-g, Thanks for the review! I made these changes: - Switched to `stats.num_rows.get_value()` instead of `Precision::Exact(num_rows)` - Added check for `round_robin_repartition()` to respect when users want extra parallelism - Added logic to get statistics, defaulting to repartitioning when stats aren't available I set the threshold to `10 * batch_size`. IMO, if the dataset size is only "in and around" a single batch size, distributing it creates "micro-batches" and causes unnecessary overhead. I am not entirely sure if this is the best value, so let me know your thoughts. --- -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
