Dan Burkert has posted comments on this change. ( http://gerrit.cloudera.org:8080/11100 )
Change subject: Implement BloomFilter Predicate in server side. ...................................................................... Patch Set 9: (15 comments) http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate-test.cc File src/kudu/common/column_predicate-test.cc: http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate-test.cc@881 PS9, Line 881: InList Should this be 'Equality'? http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.h File src/kudu/common/column_predicate.h: http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.h@230 PS9, Line 230: bool EvaluateCellForBloomFilter(DataType type, const void* cell) const; Is this used anywhere? If it's test only please indicate that in the doc. http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.h@376 PS9, Line 376: typedef typename DataTypeTraits<PhysicalType>::cpp_type cpp_type; : size_t size = sizeof(cpp_type); : const void* data = cell; : if (PhysicalType == BINARY) { : const Slice *slice = reinterpret_cast<const Slice *>(cell); : size = slice->size(); : data = slice->data(); : } : Slice cell_slice(reinterpret_cast<const uint8_t *>(data), size); : BloomKeyProbe probe(cell_slice, bf.hash_algorithm()); This entire portion could be hoisted outside the for loop, which saves recomputing the probe in the case of multiple filters. http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.h@386 PS9, Line 386: if (!BloomFilter(bf.bloom_data(), bf.nhash()).MayContainKey(probe)) { Once this for loop is simplified to just be this call, it should further simplify to just use std::all_of http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.cc File src/kudu/common/column_predicate.cc: http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.cc@115 PS9, Line 115: return ColumnPredicate(PredicateType::InBloomFilter, move(column), bfs, lower, upper); I think this should be calling Simplify(), or if there's a reason not to please leave a comment. http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.cc@385 PS9, Line 385: bloom_filters_.insert(bloom_filters_.end(), other.bloom_filters().begin(), Pretty sure this can be simplified to a straight copy, eg: bloom_filters_ = other.bloom_filters_; http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.cc@387 PS9, Line 387: if (other.lower_ != nullptr && This range portion of this code doesn't need to be duplicated if you move InBloomFilter above Range, and let it fall-through. Make sure to annotate with FALLTHROUGH_INTENDED, though. http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.cc@588 PS9, Line 588: if (new_values.empty()) { Simplify() should already handle this, so I think you can reduce this to just the else clause. http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.cc@644 PS9, Line 644: new_values.emplace_back(value); can this use copy_if like above? I think it's more elegant that way. http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.cc@648 PS9, Line 648: SetToNone(); likewise, I think this is already handled by the Simplify() call. http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.cc@661 PS9, Line 661: if (other.lower_ != nullptr && likewise I think this can be simplified with the fallthrough trick. http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.cc@895 PS9, Line 895: case PredicateType::InBloomFilter: { This could be simplified by moving it above Range and allowing it to fall through to check the range portion (after checking the annotations). Annotate with FALLTHROUGH_INTENDED if you do that. http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/column_predicate.cc@899 PS9, Line 899: auto bound_equal = [&] (const void* eleml, const void* elemr) { This is nice, consider doing the same simplification for Range. http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/common.proto File src/kudu/common/common.proto: http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/common.proto@347 PS9, Line 347: enum HashAlgorithmInBloomFilter { I suggested this back in PS5 but I think there was some confusion. I think you can call this just 'HashAlgorithm', and use it for both bloom filters as well as in HashBucketSchemaPB (where the existing one will be removed). To do that compatibly the existing HashAlgorithm values have to remain, so it should be: HashAlgorithm { UNKNOWN = 0; MURMUR_HASH_2 = 1; CITY_HASH = 2; } http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/wire_protocol.cc File src/kudu/common/wire_protocol.cc: http://gerrit.cloudera.org:8080/#/c/11100/9/src/kudu/common/wire_protocol.cc@604 PS9, Line 604: if (!bf.has_nhash() || !bf.has_bloom_data() || !bf.has_hash_algorithm()) { This should also be checking for BloomFilter::UNKNOWN when that variant is added. -- To view, visit http://gerrit.cloudera.org:8080/11100 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I62c2de42667d0255d94e19db773240f7f9ee636c Gerrit-Change-Number: 11100 Gerrit-PatchSet: 9 Gerrit-Owner: ZhangYao <triplesheep0...@gmail.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Dan Burkert <danburk...@apache.org> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: ZhangYao <triplesheep0...@gmail.com> Gerrit-Comment-Date: Fri, 05 Oct 2018 05:08:37 +0000 Gerrit-HasComments: Yes