jonathanc-n opened a new issue, #20766: URL: https://github.com/apache/datafusion/issues/20766
### Is your feature request related to a problem or challenge? In #15265 brings up that we do not really use NDVs or `distinct_count` anywhere in the code. However, it will become more practical after #19957 is merged. An optimization that uses `distinct_count` can be shown here: https://github.com/apache/datafusion/pull/20731 ### Describe the solution you'd like I looked into Trino/Spark and added a list of optimizations that can be made with this statistic: - Changing equality filter selectivity from `1/(max - min + 1)` to `1/distinct_count` - Semi/anti join selectivity calculations - Multi-join column selectivity with decay - Choose hash key with high NDV for better spread (maybe reject low NDV columns to avoid skew) - very good for distributed datafusion - Possibly check at runtime whether partial aggregation is useful enough for reduction -> trino uses the formula, `NDV x 2 > input_rows`. - Top k output cardinality estimates - count(distinct col) can use `distinct_count` ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
