[ https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622712#comment-16622712 ]
Paul Rogers edited comment on IMPALA-7310 at 9/20/18 9:08 PM: -------------------------------------------------------------- Per the suggestion of [~jeszyb], created IMPALA-7601 to describe the general issue, allowing this ticket to focus on the specific issue of handling a column full of NULLS when doing join planning. The general solution can solve this, or we can also solve this ticket with a specific fix for this one case. was (Author: paul.rogers): Per the suggestion of [~jeszyb], created IMPALA-7601 to describe the general issue, allowing this ticket to focus on the specific issue of handling a column full of NULLS. The general solution can solve this, or we can solve this with a specific fix for this one case. > Compute Stats not computing NULLs as a distinct value causing wrong estimates > ----------------------------------------------------------------------------- > > Key: IMPALA-7310 > URL: https://issues.apache.org/jira/browse/IMPALA-7310 > Project: IMPALA > Issue Type: Bug > Components: Frontend > Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, > Impala 2.11.0, Impala 3.0, Impala 2.12.0 > Reporter: Zsombor Fedor > Assignee: Paul Rogers > Priority: Major > > As seen in other DBMSs > {code:java} > NDV(col){code} > not counting NULL as a distinct value. The same also applies to > {code:java} > COUNT(DISTINCT col){code} > This is working as intended, but when computing column statistics it can > cause some anomalies (i.g. bad join order) as compute stats uses NDV() to > determine columns NDVs. > > For example when aggregating more columns, the estimated cardinality is > [counted as the product of the columns' number of distinct > values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669] > If there is a column full of NULLs the whole product will be 0. > > There are two possible fix for this. > Either we should count NULLs as a distinct value when Computing Stats in the > query: > {code:java} > SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 > as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code} > instead of > {code:java} > SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code} > > > Or we should change the planner > [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169] > to take care of this bug. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org