Re: [PR] Use NDV estimate to pre-allocate hash tables during aggregation [datafusion]

via GitHub Wed, 15 Apr 2026 13:00:05 -0700


mbutrovich commented on PR #21654:
URL: https://github.com/apache/datafusion/pull/21654#issuecomment-4255062312


   > As there also is a risk of estimating NDV too high, I added a cap for 128K 
rows (I think it should be configurable). We should remove the upper limit if 
the NDV stats are exact as well.
   
   So for Comet we have the final aggregation after a shuffle stage, and Spark 
will tell us the number of rows in the shuffle stage. That would act as an 
upper-bound for NDV for Comet, while the max NDV in any single partition would 
act as our lower-bound. Comet would have code these statistics values at plan 
generation from that stage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Use NDV estimate to pre-allocate hash tables during aggregation [datafusion]

Reply via email to