NGA-TRAN commented on issue #18341:
URL: https://github.com/apache/datafusion/issues/18341#issuecomment-3505130845

   @gene-bordegaray : Great analysis. I have read both `full report` that 
includes your studying how physical rules work and the the `issue report` that 
only include the issue
   
   **To reviewers:** If you are familiar with DataFusion already, you only need 
to read the Issue Report
   
   ## This is the summary of the fix
   
   ### CSV files that lack of statistics:
   We always add Round Robin Repartition and the plan looks like this which is 
very reasonable to me (because of no stats)
   
   ```SQL
   01)ProjectionExec: expr=[env@0 as env, count(Int64(1))@1 as count(*)]
   02)--AggregateExec: mode=FinalPartitioned, gby=[env], aggr=[count()]
   03)----CoalesceBatchesExec: target_batch_size=8192
   04)------RepartitionExec: partitioning=Hash([env], 4), input_partitions=4
   05)--------AggregateExec: mode=Partial, gby=[env], aggr=[count()]
   06)----------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
   07)------------DataSourceExec
   ```
   
   ### Parquet files with `Exact` statistics
   
   There are 2 cases
   
   1. When the file is small, we do not need to repartition it and the plan 
looks like
   
   ```SQL
   01)ProjectionExec: expr=[env, count()]
   02)--AggregateExec: mode=FinalPartitioned, gby=[env], aggr=[count()]
   03)----CoalesceBatchesExec: target_batch_size=8192
   04)------RepartitionExec: partitioning=Hash([env], 4), input_partitions=1
   05)--------AggregateExec: mode=Partial, gby=[env], aggr=[count]
   06)----------DataSourceExec
   ```
   
   2. When the file is large, we do add repartition 
   
   ```SQL
   01)ProjectionExec: expr=[env@0 as env, count(Int64(1))@1 as count(*)]
   02)--AggregateExec: mode=FinalPartitioned, gby=[env], aggr=[count()]
   03)----CoalesceBatchesExec: target_batch_size=8192
   04)------RepartitionExec: partitioning=Hash([env], 4), input_partitions=4
   05)--------AggregateExec: mode=Partial, gby=[env], aggr=[count()]
   06)----------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
   07)------------DataSourceExec
   ```
   
   The fix is exactly what we want.
   
   
   @gene-bordegaray : Could you confirm if the summary above is accurate? If it 
is, this looks like the ideal fix.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to