Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16226 )
Change subject: IMPALA-9942: DataSketches HLL shouldn't take empty strings as distinct values ...................................................................... Patch Set 2: (7 comments) Thanks for taking care of this! Left some comments during my initial read through. http://gerrit.cloudera.org:8080/#/c/16226/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16226/2//COMMIT_MSG@9 PS2, Line 9: On Hive DataSketches HLL empty string did not count to the end result This part of the sentence was very unpleasant for me to read. Could you rephrase it a bit? Just one suggestion: In Hive empty strings doesn't count as separate values when querying count(distinct) estimates using Apache DataSketches HLL algorithm http://gerrit.cloudera.org:8080/#/c/16226/2//COMMIT_MSG@10 PS2, Line 10: for compatibility nit: I'd make this a separate sentence http://gerrit.cloudera.org:8080/#/c/16226/2//COMMIT_MSG@10 PS2, Line 10: shake nit: sake http://gerrit.cloudera.org:8080/#/c/16226/2/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16226/2/be/src/exprs/aggregate-functions-ir.cc@1651 PS2, Line 1651: || dst->len == 0 Is this some leftover? dst->len is never zero in this function, see L1653. http://gerrit.cloudera.org:8080/#/c/16226/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test: http://gerrit.cloudera.org:8080/#/c/16226/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@171 PS2, Line 171: create table emptyString (s string, v varchar(1)); Could you check how much this test adds to the runtime of the full test suite? It's fine to create a table and populate it in the tests but it would be faster in my opinion to find a table in functional_parquet that contains empty strings for this purpose. I see nulltable and nullrows to have some columns with empty strings. Could you give them a try to check if that approach is faster? http://gerrit.cloudera.org:8080/#/c/16226/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@180 PS2, Line 180: ds_hll_estimate(ds_hll_sketch(s)), Could you please add the same check for ds_hll_sketch_and_estimate() ? http://gerrit.cloudera.org:8080/#/c/16226/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@188 PS2, Line 188: ---- QUERY : drop table emptyString : ==== No need for this section. This test runs on a unique database and the whole database will be dropped once this test finishes. Did you see this table being left somewhere or were you just precautious? -- To view, visit http://gerrit.cloudera.org:8080/16226 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie7648217bbe2f66b817788f131c062f349b1e9ad Gerrit-Change-Number: 16226 Gerrit-PatchSet: 2 Gerrit-Owner: Adam Tamas <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Comment-Date: Tue, 21 Jul 2020 13:59:27 +0000 Gerrit-HasComments: Yes
