buraksenn opened a new pull request, #20926: URL: https://github.com/apache/datafusion/pull/20926
## Which issue does this PR close? Part of https://github.com/apache/datafusion/issues/20766 ## Rationale for this change Grouped aggregations currently estimate output rows as input_rows, ignoring available NDV statistics. Spark's AggregateEstimation and Trino's AggregationStatsRule both use NDV products to tighten this estimate. This PR is highly referenced by both. - [Spark reference](https://github.com/apache/spark/blob/e8d8e6a8d040d26aae9571e968e0c64bda0875dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala#L38-L61) - [Trino reference](https://github.com/trinodb/trino/blob/43c8c3ba8bff814697c5926149ce13b9532f030b/core/trino-main/src/main/java/io/trino/cost/AggregationStatsRule.java#L92-L101) ## What changes are included in this PR? - Estimate aggregate output rows as min(input_rows, product(NDV_i + null_adj_i) * grouping_sets) - Cap by Top K limit when active since output row cannot be higher than K - Propagate distinct_count from child stats to group-by output columns ## Are these changes tested? Yes existing and new tests that cover different scenarios and edge cases ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
