[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2020-09-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188602#comment-17188602
 ] 

ASF subversion and git services commented on IMPALA-7310:
-

Commit f4273a40fe2e2ab0806f0324ecf33a25b3df6279 in impala's branch 
refs/heads/master from Shant Hovsepian
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f4273a4 ]

IMPALA-7310: Partial fix for NDV cardinality with NULLs.

This fix just handles the case where a column's cardinality is zero
however it's nullable and we have null stats to indicate there are null
values, therefore we adjust the cardinality from 0 to 1.

The cardinality of zero was especially problematic when calculating
cardinalities for multiple predicates with multiplication. The 0 would
propagate up the plan tree and result in poor plan choices such as
always using broadcast joins where shuffle would've been more optimal.

Testing:
  * 26 Node TPC-DS 30TB run had better plans for Q4 and Q11
- Q4 172s -> 80s
- Q11 103s -> 77s
  * CardinalityTest
  * TpcdsPlannerTest

Change-Id: Iec967053b4991f8c67cde62adf003cbd3f429032
Reviewed-on: http://gerrit.cloudera.org:8080/16349
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2019-05-10 Thread Tim Armstrong (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837665#comment-16837665
 ] 

Tim Armstrong commented on IMPALA-7310:
---

WIP patch - https://gerrit.cloudera.org/#/c/11528/


> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2019-03-21 Thread Balazs Jeszenszky (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797992#comment-16797992
 ] 

Balazs Jeszenszky commented on IMPALA-7310:
---

[~Paul.Rogers] are you working on this?

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-24 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626483#comment-16626483
 ] 

Paul Rogers commented on IMPALA-7310:
-

Final solution is even simpler, since we don't want to change anything except 
null handling. If a table has stats, and if NDV <= 10, then add one to it to 
account for the fact that the planner defines NDV to include null values, while 
the NDV function (and thus stats) excludes null from the NDV.

Above 10, the difference of including null becomes minor, so to minimize the 
impact of the change, we phase out the adjustments once NDV gets above 10.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-20 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622856#comment-16622856
 ] 

Paul Rogers commented on IMPALA-7310:
-

The planner uses NDVs to make binary decisions: do I do x or y? (Do I put t1 on 
the build side of a join, or to I put it on the probe site?) In most cases, the 
values being compared are order-of-magnitude different, and so fine nuances of 
value are not important. We simply need some reasonable non-zero number so that 
the calcs can play out.

The simplest fix is to handle the non-stats case for a 
[{{ColumnStats}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java]
 instance.

The current code in {{initColStats()}} initializes NDV to -1 (undefined). 
Suggested alternatives:

 * If type is Boolean, NDV = 2
 * If type is TINYINT, NDV = 256.
 * If type is anything else, assume NDV = some constant, say 1000.

Note that the above logic actually already exists in 
{{createHiveColStatsData()}} where it is used to bound the NDV value. So, we 
just reuse it. That code also suggests we can NDV at row count. But, since our 
guesses are small, the row count might not add much value. Or, our NDV guess 
might be some fraction of row count, if we want to be fancy.

Then, add a flag (if some useful field does not exist) to indicate if the NDV 
is from stats or estimates. (Will be useful in computing selectivity later.)

Finally, in {{createHiveColStatsData}}, set a floor on NDV at 1 to account for 
the fact that an all-null column has HDV=0. Or, to be conservative, if NDV < 
10, add one to NDV to account for nulls.

Next, modify {{update()}} to use the defaults (to be set in {{initColStats()}} 
for the "incompatible" case.

As a result, when the plan nodes ask for NDV, they won't get a 0 value if we 
have no data, nor will they get 0 if a column is all nulls.

Add or modify unit tests to verify the above logic, especially the defaults 
case and how the defaults propagate up the plan tree.

The risk is that some plans will change. We hope they change to favor getting 
the correct plan more often. But, there will be some use case for which the 
old, wrong, values produced a more accurate plan than the new estimates. This 
is always a risk.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-20 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622712#comment-16622712
 ] 

Paul Rogers commented on IMPALA-7310:
-

Per the suggestion of [~jeszyb], created IMPALA-7601 to describe the general 
issue, allowing this ticket to focus on the specific issue of handling a column 
full of NULLS. The general solution can solve this, or we can solve this with a 
specific fix for this one case.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-20 Thread Tim Armstrong (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622239#comment-16622239
 ] 

Tim Armstrong commented on IMPALA-7310:
---

WRT to throwing away data, the philosophy in the Impala planner is generally 
not to try to hard to infer statistics if "proper" statistics are missing. E.g. 
we don't try to estimate row count from file size, so the planner can make 
"obviously dumb" decisions with join ordering in the absence of stats.

I think the motivation was to avoid investing too much effort (implementation 
and maintenance) to handle cases where best practices weren't followed, and 
also as a way to discourage bad practices.

I don't know if this is the right decision, but there's some sense to it.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-19 Thread Philip Zeyliger (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621472#comment-16621472
 ] 

Philip Zeyliger commented on IMPALA-7310:
-

I've not looked at how that test is hooked up, but it's entirely possible that 
constant folding isn't happening in the test but would be happening in real 
life for the constant expression.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-19 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621434#comment-16621434
 ] 

Paul Rogers commented on IMPALA-7310:
-

Odd. Looked at the tests in {{ExprNdvTest}}. We have tests such as:

{noformat}
verifyNdv("case when id = 1 then 'yes' else 'no' end", 2);
{noformat}

This says that we convert the rand of values in column {{id}} into two values: 
{{'yes'}} and {{'no'}}. So, the NDV of the expression is 2. Solid reasoning.

Then I added a new test using only constants:

{noformat}
verifyNdv("case when 0 = 1 then 'yes' else 'no' end", 2);
{noformat}

In this case, the NDV is reported as 2, when it should be 1 (the expression is 
constant.)

Yes, this is a nit and few users will create such SQL. Still, I wonder if this 
suggests other inconsistencies in NDV & selectivity handling?

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-19 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621222#comment-16621222
 ] 

Paul Rogers commented on IMPALA-7310:
-

[~jeszyb], agree completely. Here I'm digging down to do a root cause analysis. 
Once we know what's what, we can figure out how to divide work up into tickets, 
and which to do when. What I want to avoid is the whack-a-mole effect: solve 
part of the problem which just causes another problem elsewhere.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-19 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621196#comment-16621196
 ] 

Paul Rogers commented on IMPALA-7310:
-

Here, it is worth pointing out the risk of any change. The current model 
appears to not fit some use case (the one that triggered this ticket.) But, it 
may well fit some real-world use cases. The risk of this (or any) change is 
that we better fit the majority of use case, but we end up sub-optimizing for 
the occasional use case that happens to fit the current model.

Next step is to do some testing to see the impact of any change on existing 
tests. Anyone know of potential problems? Especially problems that led is to 
the current implementation of how we handle missing NDV values?

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-19 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621171#comment-16621171
 ] 

Paul Rogers commented on IMPALA-7310:
-

The original description pointed out the method that computes the compound NDV:

{noformat}
for (Expr expr: exprs) {
  if (expr.getNumDistinctValues() == -1) {
numDistinctValues = -1;
break;
  }
  numDistinctValues *= expr.getNumDistinctValues();
}
{noformat}

This logic says, essentially, "if I don't know the actual NDV for any one group 
of rows, then I don't know the NDV for the combination." This is literally 
true. But, we ultimately need to make a decision: using some information is 
better than using none.

Perhaps, when we don't know the NDV, we should assume one:

* If the NDV > 0, use it.
* If 0 and we have metadata, assume 1.
* If 0 and we have no metadata, assume 10 (see above.)

As suggested in a previous node, perhaps we can compute this number in each 
node (because each node has the most information about the best a-priority 
estimate for that expression/operator.) Then, we allow those a-priority 
estimates to bubble up the tree with no special handling.

If we want, we could flag each node with a the source of the info (the three 
cases above) to aid in debugging.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-19 Thread Balazs Jeszenszky (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621143#comment-16621143
 ] 

Balazs Jeszenszky commented on IMPALA-7310:
---

[~Paul.Rogers] this one is about the fact that compute stats will come up with 
a 0 NDV for an all-NULL column of non-zero cardinality. As a nice to have, we 
could make cardinality estimation more robust against columns with 0 NDV, but 
post this fix, that value would either be valid or manually set.
The rest sounds like valid improvement (it's probably that way because no one 
made it better yet), but would need their own jiras.

FWIW, IMPALA-7528 is also somewhat related, if you want to fix these in one go.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-19 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621126#comment-16621126
 ] 

Paul Rogers commented on IMPALA-7310:
-

As noted above, the code uses -1 as an "undefined" selectivity marker. We saw 
that, if NDV=0, we leave the value at "undefined". A quick read suggests we do 
special handling further up the expression tree. See 
{{PlanNode.coputeCombinedSelectify}}.

1. Remove conjuncts (expressions) with undefined selectivity.
 2. If no expressions remain, add one with {{DEFAULT_SELECTIVITY}}.

This means that if we have no metadata, we treat the following identically:
{noformat}
WHERE x = 10
WHERE x = 10 AND y = 20 AND z = 30
{noformat}
Both are assumed to have a 0.1 selectivity, when one might argue that the 
second should have 0.1^3 = 0.001 selectivity. (Actual code uses an exponential 
back-off, ignored here.)

Note also the somewhat over-generalize use of {{DEFAULT_SELECTIVITY}}. If x = 
10 has 0.1 selectivity, then does it make sense to assume that x < 10 has the 
same? Some DB's assume a larger number (0.5, say) for inequalities. This 
subtlety is lost in the 0.1 for everything approach.

Is this all this a bug or a feature?

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-19 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621096#comment-16621096
 ] 

Paul Rogers commented on IMPALA-7310:
-

Simplest case: a binary predicate. Current behavior in 
{{BinaryPredicate.analyzeImpl()}}:

{noformat}
  long distinctValues = slotRefRef.getRef().getNumDistinctValues();
  if (distinctValues > 0) {
selectivity_ = 1.0 / distinctValues;
selectivity_ = Math.max(0, Math.min(1, selectivity_));
  }
{noformat}

So, if NDV is 0 (all nulls), selectivity is left at its default (-1, which 
seems to mean "undefined".)

There are really two cases here:

1. We have stats. The 0 for NDV tells us that we either have 0 rows, or all 
rows are null. In this case, we can safely assume that the NDV-with-nulls is 1.
2. We have no stats. NDV is unknown. But, we can use our default estimate in 
{{Expr.DEFAULT_SELECTIVITY = 0.1}} to compute an a-priority NDV estimate of 
1/selectivity = 10. (Or, more directly, use {{DEFAULT_SELECTIVITY}} as the 
expression selectivity.)

To even consider such a change, we'd need to test the various paths. The only 
test I could find for expressions and NDV are {{ExprNdvTest}} which tests with 
and without stats, but not with an NDV=0 (as far as I can tell.)

Note that this test case uses a different definition for NDV than the 
"distinct-non-null-values" definition used by the {{NDV\(x)}} function:

{noformat}
// When else not specified, it is NULL, verify it is counted
verifyNdv("case when id = 1 then 'yes' end", 2);
{noformat}

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-09-18 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619968#comment-16619968
 ] 

Paul Rogers commented on IMPALA-7310:
-

Simple reproduction:

{noformat}
create table t1 (x int, y varchar);
insert into t1 values (1, null), (2, null), (3, null), (4, null);
select ndv(y) from t1;

++
| ndv(y) |
++
| 0  |
++
{noformat}

Reference to Impala's [null handling in {{COUNT 
DISTINCT}}|https://impala.apache.org/docs/build/html/topics/impala_distinct.html]:
 "NULL is not included as a distinct value in the count."

Conclusion: given the definition of "distinct" above, the {{ndv\(x)}} function 
appears defined as the "number of distinct non-null values".

There appear to be two other cases where the NDV value can be null:

* New table, no stats at all.
* Empty table.

Where else might the NDV come into play?

Like other databases, when using a {{GROUP BY}} query, Impala places all 
{{NULL}} values into a single group. This means that, when using the NDV value 
to estimate group count, a simple floor (say, 1) will not badly underestimate 
group counts.

All this suggests that [~joemcdonnell]'s suggestion of setting a floor in the 
planner is the best place to start.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Assignee: Paul Rogers
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-07-26 Thread Tim Armstrong (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558855#comment-16558855
 ] 

Tim Armstrong commented on IMPALA-7310:
---

Talked with [~joemcdonnell] - we think we should fix this in the planner using 
cardinality to set a floor on NDV.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

2018-07-17 Thread Tim Armstrong (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546796#comment-16546796
 ] 

Tim Armstrong commented on IMPALA-7310:
---

The right way to adjust the behaviour of the NDV computation would be to have a 
variant of NDV that includes NULL: 
https://github.com/apache/impala/blob/master/be/src/exprs/aggregate-functions-ir.cc#L1427

Probably the ideal solution is to compute a null count too and use that in the 
planner.

Another idea to work around the pathological behaviour is to set the NDV 
estimate to 1 if the NDV in stats is zero but the cardinality is > 0.

> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -
>
> Key: IMPALA-7310
> URL: https://issues.apache.org/jira/browse/IMPALA-7310
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, 
> Impala 2.11.0, Impala 3.0, Impala 2.12.0
>Reporter: Zsombor Fedor
>Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can 
> cause some anomalies (i.g. bad join order) as compute stats uses NDV() to 
> determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is 
> [counted as the product of the columns' number of distinct 
> values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669]
>  If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the 
> query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 
> as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner 
> [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169]
>  to take care of this bug.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org