Re: [I] Eliminate Repartitioning for Small Datasets [datafusion]

via GitHub Sun, 16 Nov 2025 02:03:57 -0800


ShashidharM0118 commented on issue #18595:
URL: https://github.com/apache/datafusion/issues/18595#issuecomment-3538482301


   Hey everyone! I've been following this discussion and agree with 
@LiaCastaneda that Option 2 (Physical Planner approach) is better since we can 
choose `AggregateMode::Single` upfront for small datasets, avoiding the 
backwards compatibility issues of skipping repartitions after 
`FinalPartitioned` mode is set. As @NGA-TRAN  mentioned, I'd like to start with 
Parquet files since they already have row count statistics available. I'm 
thinking we could add a check in the physical planner that uses the row count 
to decide the aggregate mode. Would it be okay if I work on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Eliminate Repartitioning for Small Datasets [datafusion]

Reply via email to