Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17344 )

Change subject: IMPALA-10677: Set selectivity of Not-equal
......................................................................


Patch Set 4:

(7 comments)

Thanks you for working on this. Seems like NULL values are not handled 
correctly right now.

I think we should either use the number of NULLs from column stats and return 
correct results, or we should not set selectivity when the parameter is NULL.

Please note that the NOT DISTINCT FROM test also produces wrong values in case 
of NULLs.

http://gerrit.cloudera.org:8080/#/c/17344/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17344/4//COMMIT_MSG@7
PS4, Line 7: IMPALA-10677
You might want to use IMPALA-7560 as this was the original Jira and it contains 
more information.


http://gerrit.cloudera.org:8080/#/c/17344/4//COMMIT_MSG@11
PS4, Line 11: col = 5", but not "2 * col = 10
nit: wouldn't be better to use != in the example?

 "col != 5", but not "2 * col != 10"


http://gerrit.cloudera.org:8080/#/c/17344/3/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
File fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java:

http://gerrit.cloudera.org:8080/#/c/17344/3/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java@257
PS3, Line 257: distinctValues
> Okay. Thanks for the verification. We could make an adjustment here if ther
SlotRef retrieves the NDV from the SlotDescriptor's ColumnStats in 
SlotRef.adjustNumDistinctValues() .

ColumnStats also has NumNulls which could be used here:

https://github.com/apache/impala/blob/954eb5c85d329af7690698cdc4d0f409260e6d18/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java#L221


http://gerrit.cloudera.org:8080/#/c/17344/4/fe/src/test/java/org/apache/impala/analysis/ExprCardinalityTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprCardinalityTest.java:

http://gerrit.cloudera.org:8080/#/c/17344/4/fe/src/test/java/org/apache/impala/analysis/ExprCardinalityTest.java@a372
PS4, Line 372:
That seems wrong looking at the data. I think it should be 1.


http://gerrit.cloudera.org:8080/#/c/17344/4/fe/src/test/java/org/apache/impala/analysis/ExprCardinalityTest.java@344
PS4, Line 344: 7300
NULL values are not included in the NDV, also null count is stored separately 
in the column stats.
I think we should handle NULLs specially, as mentioned in the above comment.

I.e. we should have the same values here that are commented out in the original 
test case.In this case 1.0, as there are no NULL values for 'id' in the 
alltypes table.


http://gerrit.cloudera.org:8080/#/c/17344/4/fe/src/test/java/org/apache/impala/analysis/ExprCardinalityTest.java@355
PS4, Line 355: 1 - 1.0/1
I think the expected value is 0 here, as null_str is all nulls.


http://gerrit.cloudera.org:8080/#/c/17344/4/fe/src/test/java/org/apache/impala/analysis/ExprCardinalityTest.java@359
PS4, Line 359: some_nulls is not distinct from 'foo'",
I think it'd be worth to add a test case with

 some_nulls is distinct from null

I think the expected value would be "6.0 / 26.0", i.e. (NumRows - NumNulls) / 
(NumRows)



--
To view, visit http://gerrit.cloudera.org:8080/17344
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icd6f5945840ea2a8194d72aa440ddfa6915cbb3a
Gerrit-Change-Number: 17344
Gerrit-PatchSet: 4
Gerrit-Owner: liuyao <liu...@sensorsdata.cn>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: liuyao <liu...@sensorsdata.cn>
Gerrit-Comment-Date: Fri, 11 Jun 2021 15:36:03 +0000
Gerrit-HasComments: Yes

Reply via email to