[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621143#comment-16621143
 ] 

Balazs Jeszenszky commented on IMPALA-7310:
-------------------------------------------

[~Paul.Rogers] this one is about the fact that compute stats will come up with 
a 0 NDV for an all-NULL column of non-zero cardinality. As a nice to have, we 
could make cardinality estimation more robust against columns with 0 NDV, but 
post this fix, that value would either be valid or manually set.
The rest sounds like valid improvement (it's probably that way because no one 
made it better yet), but would need their own jiras.

FWIW, IMPALA-7528 is also somewhat related, if you want to fix these in one go.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -----------------------------------------------------------------------------
>
>                 Key: IMPALA-7310
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7310
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>            Reporter: Zsombor Fedor
>            Assignee: Paul Rogers
>            Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to