mbutrovich commented on PR #21654: URL: https://github.com/apache/datafusion/pull/21654#issuecomment-4255062312
> As there also is a risk of estimating NDV too high, I added a cap for 128K rows (I think it should be configurable). We should remove the upper limit if the NDV stats are exact as well. So for Comet we have the final aggregation after a shuffle stage, and Spark will tell us the number of rows in the shuffle stage. That would act as an upper-bound for NDV for Comet, while the max NDV in any single partition would act as our lower-bound. Comet would have code these statistics values at plan generation from that stage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
