Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16226 )

Change subject: IMPALA-9942: DataSketches HLL shouldn't take empty strings as 
distinct values
......................................................................


Patch Set 2:

(7 comments)

Thanks for taking care of this! Left some comments during my initial read 
through.

http://gerrit.cloudera.org:8080/#/c/16226/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16226/2//COMMIT_MSG@9
PS2, Line 9: On Hive DataSketches HLL empty string did not count to the end 
result
This part of the sentence was very unpleasant for me to read. Could you 
rephrase it a bit? Just one suggestion:
In Hive empty strings doesn't count as separate values when querying 
count(distinct) estimates using Apache DataSketches HLL algorithm


http://gerrit.cloudera.org:8080/#/c/16226/2//COMMIT_MSG@10
PS2, Line 10: for compatibility
nit: I'd make this a separate sentence


http://gerrit.cloudera.org:8080/#/c/16226/2//COMMIT_MSG@10
PS2, Line 10: shake
nit: sake


http://gerrit.cloudera.org:8080/#/c/16226/2/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16226/2/be/src/exprs/aggregate-functions-ir.cc@1651
PS2, Line 1651: || dst->len == 0
Is this some leftover?
dst->len is never zero in this function, see L1653.


http://gerrit.cloudera.org:8080/#/c/16226/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File 
testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16226/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@171
PS2, Line 171: create table emptyString (s string, v varchar(1));
Could you check how much this test adds to the runtime of the full test suite? 
It's fine to create a table and populate it in the tests but it would be faster 
in my opinion to find a table in functional_parquet that contains empty strings 
for this purpose.
I see nulltable and nullrows to have some columns with empty strings. Could you 
give them a try to check if that approach is faster?


http://gerrit.cloudera.org:8080/#/c/16226/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@180
PS2, Line 180: ds_hll_estimate(ds_hll_sketch(s)),
Could you please add the same check for ds_hll_sketch_and_estimate() ?


http://gerrit.cloudera.org:8080/#/c/16226/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@188
PS2, Line 188: ---- QUERY
             : drop table emptyString
             : ====
No need for this section. This test runs on a unique database and the whole 
database will be dropped once this test finishes.
Did you see this table being left somewhere or were you just precautious?



--
To view, visit http://gerrit.cloudera.org:8080/16226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie7648217bbe2f66b817788f131c062f349b1e9ad
Gerrit-Change-Number: 16226
Gerrit-PatchSet: 2
Gerrit-Owner: Adam Tamas <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Comment-Date: Tue, 21 Jul 2020 13:59:27 +0000
Gerrit-HasComments: Yes

Reply via email to