[I] EPIC: Making use of NDVs (number of distinct values) in DataFusion [datafusion]

via GitHub Fri, 06 Mar 2026 17:22:14 -0800


jonathanc-n opened a new issue, #20766:
URL: https://github.com/apache/datafusion/issues/20766


   ### Is your feature request related to a problem or challenge?
   
   In #15265 brings up that we do not really use NDVs or `distinct_count` 
anywhere in the code. However, it will become more practical after #19957 is 
merged. 
   
   An optimization that uses `distinct_count` can be shown here: 
https://github.com/apache/datafusion/pull/20731 
   
   ### Describe the solution you'd like
   
   I looked into Trino/Spark and added a list of optimizations that can be made 
with this statistic:
    - Changing equality filter selectivity from `1/(max - min + 1)` to 
`1/distinct_count`
    - Semi/anti join selectivity calculations
    - Multi-join column selectivity with decay
    - Choose hash key with high NDV for better spread (maybe reject low NDV 
columns to avoid skew) - very good for distributed datafusion
    - Possibly check at runtime whether partial aggregation is useful enough 
for reduction -> trino uses the formula, `NDV x 2 > input_rows`.
    - Top k output cardinality estimates
    - count(distinct col) can use `distinct_count`
    
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] EPIC: Making use of NDVs (number of distinct values) in DataFusion [datafusion]

Reply via email to