Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )
Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. ...................................................................... Patch Set 3: (14 comments) http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1443 PS3, Line 1443: bucket_ I don't understand why we have "bucket" as a prefix here. Both the row_count and hll_state are per bucket, right? http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1456 PS3, Line 1456: existing issues with passing constant arguments to all : /// aggregation phases what does that mean? is there a JIRA we can just reference? http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1469 PS3, Line 1469: int64_t* GetCountPtr(int bucket_idx) { : return reinterpret_cast<int64_t*>(GetBucketPtr(bucket_idx)); : } : : uint8_t* GetHllPtr(int bucket_idx) { : return GetBucketPtr(bucket_idx) + sizeof(int64_t); : } : : uint8_t* GetBucketPtr(int bucket_idx) { : return reinterpret_cast<uint8_t*>(this) + : sizeof(SampledNdvState) + bucket_idx * BUCKET_SIZE; : } given that NUM_HLL_BUCKETS is predetermined, why not just use an array: struct { int64_t row_count, uint8_t hll[HLL_LEN] } buckets[NUM_HLL_BUCKETS]; http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1500 PS3, Line 1500: void AggregateFunctions::SampledNdvUpdate(FunctionContext* ctx, const T& src, I think a quick comment explaining what we're doing here would be helpful, something like Incorporate the row into one of the intermediate HLLs, which will be used by Finalize to generate a set of the (x,y) points. http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1503 PS3, Line 1503: state->total_row_count % SampledNdvState::NUM_HLL_BUCKETS; I wonder if picking a random bucket is better? We talked a bit about this. Up to you whether you think it's worth investigating. http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1533 PS3, Line 1533: int64_t counts[num_points]; : memset(counts, 0, num_points * sizeof(int64_t)); how about: int64_t counts[num_points] = { 0 }; http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1542 PS3, Line 1542: diminishing returns from creating additional data points not sure what that means. do you mean additional to 'num_points'? i.e. are you saying that empirically, 'num_points' is a sufficient number of points? http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h File be/src/exprs/aggregate-functions.h: http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h@209 PS3, Line 209: x/y I think using a notation like "(x, y)" maybe clearer since this sounds like divide. http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h File be/src/util/mpfit-util.h: http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h@74 PS3, Line 74: /// Human-readable name of this function. Used for debugging. should any of these members be private? http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc File be/src/util/mpfit-util.cc: http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc@33 PS3, Line 33: function but what is this function suppose to do? Does it make sense to call it a "fitting function"? http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java: http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@519 PS3, Line 519: children_.get(1) samplePerc? http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@524 PS3, Line 524: children_.get(1) samplePerc? why don't we have to do anything with the return value of uncheckedCastTo() (like set it as the children_ at index 1)? http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py File tests/query_test/test_aggregation.py: http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@337 PS3, Line 337: * sample_perc even with the comment, I'm not sure what's going on here. http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@339 PS3, Line 339: def __appx_equals(self, a, b, diff_perc): a quick comment for this would be helpful. -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 3 Gerrit-Owner: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Jim Apple <jbapple-imp...@apache.org> Gerrit-Comment-Date: Sat, 02 Dec 2017 00:20:45 +0000 Gerrit-HasComments: Yes