gene-bordegaray commented on issue #18341: URL: https://github.com/apache/datafusion/issues/18341#issuecomment-3505283923
> [@gene-bordegaray](https://github.com/gene-bordegaray) : Great analysis. I have read both `full report` that includes your studying how physical rules work and the the `issue report` that only include the issue > > **To reviewers:** If you are familiar with DataFusion already, you only need to read the Issue Report > > ## This is the summary of the fix > ### CSV files that lack of statistics: > We always add Round Robin Repartition and the plan looks like this which is very reasonable to me (because of no stats) > > 01)ProjectionExec: expr=[env@0 as env, count(Int64(1))@1 as count(*)] > 02)--AggregateExec: mode=FinalPartitioned, gby=[env], aggr=[count()] > 03)----CoalesceBatchesExec: target_batch_size=8192 > 04)------RepartitionExec: partitioning=Hash([env], 4), input_partitions=4 > 05)--------AggregateExec: mode=Partial, gby=[env], aggr=[count()] > 06)----------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 > 07)------------DataSourceExec > ### Parquet files with `Exact` statistics > There are 2 cases > > 1. When the file is small, we do not need to repartition it and the plan looks like > > 01)ProjectionExec: expr=[env, count()] > 02)--AggregateExec: mode=FinalPartitioned, gby=[env], aggr=[count()] > 03)----CoalesceBatchesExec: target_batch_size=8192 > 04)------RepartitionExec: partitioning=Hash([env], 4), input_partitions=1 > 05)--------AggregateExec: mode=Partial, gby=[env], aggr=[count] > 06)----------DataSourceExec > 2. When the file is large, we do add repartition > > 01)ProjectionExec: expr=[env@0 as env, count(Int64(1))@1 as count(*)] > 02)--AggregateExec: mode=FinalPartitioned, gby=[env], aggr=[count()] > 03)----CoalesceBatchesExec: target_batch_size=8192 > 04)------RepartitionExec: partitioning=Hash([env], 4), input_partitions=4 > 05)--------AggregateExec: mode=Partial, gby=[env], aggr=[count()] > 06)----------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 > 07)------------DataSourceExec > The fix is exactly what we want. > > [@gene-bordegaray](https://github.com/gene-bordegaray) : Could you confirm if the summary above is accurate? If it is, this looks like the ideal fix. Yes this is. great summary, thank you Nga -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
