Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/11100 )
Change subject: Implement BloomFilter Predicate in server side. ...................................................................... Patch Set 6: (68 comments) Looks good, most of my comments are stylistic. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc File src/kudu/common/column_predicate-test.cc: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@45 PS6, Line 45: static const int kRandomSeed = 0xdeadbeef; Can you use SeedRandom() instead? That way we'll use a different seed for every run of the test and get additional test coverage while retaining the ability to rerun a test with a particular seed using --test_random_seed. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@77 PS6, Line 77: BloomFilterBuilder *bfb1, : BloomFilterBuilder *bfb2) { Nit: BloomFilterBuilder* bfb1 http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@79 PS6, Line 79: srandom(random_seed); Don't need if you use SeedRandom(). http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@83 PS6, Line 83: uint64_t key = random(); Use util/Random instead. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@88 PS6, Line 88: const uint8_t * const uint8_t* (the theme here is that when using a pointer type, the asterisk should be adjacent to the type rather than separated by a space). http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@92 PS6, Line 92: values->push_back(key); emplace_back in new code. Elsewhere too. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@791 PS6, Line 791: std::vector<ColumnPredicate::BloomFilterInner>* bf, Don't need std:: prefix. Below too. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@793 PS6, Line 793: std::vector<ColumnPredicate::BloomFilterInner> hold = *bf; Nit: maybe name "orig_bloom_filters" to better convey its purpose? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@828 PS6, Line 828: std::vector<ColumnPredicate::BloomFilterInner> bf_bak = *bf; bf_copy maybe? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@1382 PS6, Line 1382: int n_keys = 5; // 0 1both hit bf1 and bf2, 2 3 4 only hit bf2. 0 1 both hit http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@1414 PS6, Line 1414: // 0 1both hit bf1 and bf2, 2 3 4 only hit bf2. 0 1 both hit http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@1429 PS6, Line 1429: // Test for BINARY type Each of these "Test for ... type" is independent of the other tests, right? Wouldn't it be cleaner to declare them each as a separate TEST_F? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h File src/kudu/common/column_predicate.h: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@275 PS6, Line 275: : const Slice& bloom_data() const { : return bloom_data_; : } : : size_t nhash() const { : return nhash_; : } : : HashAlgorithmInBloomFilter hash_algorithm() const { : return hash_algorithm_; : } : : void set_nhash(size_t nhash) { : nhash_ = nhash; : } : : void set_bloom_data(Slice bloom_data) { : bloom_data_ = bloom_data; : } : : void set_hash_algorithm(HashAlgorithmInBloomFilter hash_algorithm) { : hash_algorithm_ = hash_algorithm; : } I don't see the point of these getters and setters since the members are all public to begin with. Why not just access the members directly? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@301 PS6, Line 301: return (bloom_data_ == other.bloom_data_ && nhash_ == other.nhash_ : && hash_algorithm_ == other.hash_algorithm_); Nit: reformat like: return (bloom_data_ == other.bloom_data_ && nhash_ == other.nhash_ && hash_algorithm_ == other.hash_algorithm_); That way it's less noisy to add new members. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@304 PS6, Line 304: // The slice of bloom filter. Nit: separate each member from the member before it (or from operator==) with an empty line. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@308 PS6, Line 308: , currently support : // murmur2 and cityhash. I don't think this comment is necessary; the list of supported algorithms is embodied by the set of enum values. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@354 PS6, Line 354: const ColumnPredicate & Nit: const ColumnPredicate& http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@425 PS6, Line 425: // The vector of bloom filter in this predicate. Nit: "The list of bloom filters in this predicate." http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc File src/kudu/common/column_predicate.cc: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@68 PS6, Line 68: : predicate_type_(predicate_type), : column_(move(column)), : lower_(lower), : upper_(upper) { Nit: use the same indentation as in the other two constructors. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@113 PS6, Line 113: CHECK(!bfs->empty()); Should check that bfs != nullptr. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@189 PS6, Line 189: bloom_filters_.clear(); Hmm, why don't we clear values_ in this function? Why is it necessary to clear bloom_filters_ but not values_? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@226 PS6, Line 226: predicate_type_ = PredicateType::None; : lower_ = nullptr; : upper_ = nullptr; Could use SetToNone here. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@254 PS6, Line 254: case PredicateType::InBloomFilter: { If we allowed an empty list of bloom filters, could that simplify into something else? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@264 PS6, Line 264: if (CheckValueInBloomFilter(lower_)) { : predicate_type_ = PredicateType::Equality; : upper_ = nullptr; : bloom_filters_.clear(); : } else { : SetToNone(); : } Can you explain the rationale behind this simplification? It's not obvious to me. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@275 PS6, Line 275: upper_ = nullptr; Isn't upper_ guaranteed to be nullptr in this branch? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@277 PS6, Line 277: if (CheckValueInBloomFilter(lower_)) { : predicate_type_ = PredicateType::Equality; : upper_ = nullptr; : bloom_filters_.clear(); : } else { : SetToNone(); : } Also not understanding this simplification; could you explain in a comment? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@287 PS6, Line 287: SetToNone(); Why is this case different from the L274-L275 case? That is, why do we use SetToNone() here but not there? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@443 PS6, Line 443: if (!other.CheckValueInBloomFilter(lower_)) { : SetToNone(); : } Can you explain this? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@616 PS6, Line 616: // Merge the optional lower and upper bound. : if (other.lower_ != nullptr && : (lower_ == nullptr || column_.type_info()->Compare(lower_, other.lower_) < 0)) { : lower_ = other.lower_; : } : if (other.upper_ != nullptr && : (upper_ == nullptr || column_.type_info()->Compare(upper_, other.upper_) > 0)) { : upper_ = other.upper_; : } This pattern seems to repeat itself quite a bit in this class. Could you decompose it into one or two helper functions in an anonymous namespace? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@630 PS6, Line 630: // value falls in list so change to Equality predicate Nit: "bloom filter", not "list" http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@636 PS6, Line 636: in list in bloom filter http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@647 PS6, Line 647: std::vector<const void *> new_values; Nit: vector<const void*> new_values; (no need for std:: prefix either). http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@650 PS6, Line 650: new_values.push_back(value); Nit: emplace_back in new code. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@870 PS6, Line 870: return strings::Substitute("`$0` IS BloomFilter", column_.name()); IS IN BloomFilter, perhaps? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto File src/kudu/common/common.proto: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@357 PS6, Line 357: // The bloomfilter for quick filter the row. What does "quick filter the row" mean? Could you elaborate? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@362 PS6, Line 362: // And We currently use CityHash backend. Doesn't this depend on the value of hash_algorithm? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@365 PS6, Line 365: optional HashAlgorithmInBloomFilter hash_algorithm = 3; Does this need a default value specifier for CITY_HASH? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@377 PS6, Line 377: We will eventually add a special : // predicate type for null-ness. Nit: not your fault, but could you remove this sentence? We've since added IsNotNull and IsNull predicates. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@404 PS6, Line 404: repeated BloomFilter bloom_filters = 1; Nit: add a comment explaining what this is. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@406 PS6, Line 406: // Lower and Upper is optional for InBloomFilter. How do lower and upper work with bloom filters? Can you doc in a comment? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@408 PS6, Line 408: optional bytes lower = 3 [(kudu.REDACT) = true]; Use field id 2 here and 3 for 'upper'; no reason to skip 2 since there wasn't a field 2 to begin with. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/key_util.cc File src/kudu/common/key_util.cc: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/key_util.cc@234 PS6, Line 234: case PredicateType::InBloomFilter: // Upper in InBloomFilter process as upper in Range. Nit: "upper in InBloomFilter processed as upper in Range" Below too (with 'lower' instead of 'upper') http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc File src/kudu/common/wire_protocol-test.cc: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@44 PS6, Line 44: #include "kudu/util/bloom_filter.h" Not sorted correctly. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@508 PS6, Line 508: Arena arena(1024); WireProtocolTest already has an arena, could reuse it? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@536 PS6, Line 536: ASSERT_NO_FATAL_FAILURE(ColumnPredicateToPB(ibf, &pb)); NO_FATALS in new code. Below too. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@555 PS6, Line 555: TEST_F(WireProtocolTest, TestColumnPredicateBloomFilterWithBound) { The setup code is identical in both tests; can you decompose it into a new test fixture? Something like: class BFWireProtocolTest : public WireProtocolTest { <members or functions for these two tests> } TEST_F(BFWireProtocolTest, Blah) { ... } http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc File src/kudu/common/wire_protocol.cc: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@416 PS6, Line 416: const void* src; : size_t size; : src = bf_src.bloom_data().data(); : size = bf_src.bloom_data().size(); Combine these four lines into two (i.e. combine declaration and assignment). http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@455 PS6, Line 455: // Extract BloomFilterInner for bloom data for ColumnBloomFilterPredicate. from http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@461 PS6, Line 461: uint8_t* data_copy = static_cast<uint8_t*>(arena->AllocateBytes(bf_src.bloom_data().size())); bf_src.bloom_data().size() is repeated three times; can you store in a local variable with a shorter name? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@463 PS6, Line 463: const Slice* slice = arena->NewObject<Slice>(data_copy, bf_src.bloom_data().size()); Why do we need to allocate the slice itself from the arena? Seems like we turn around and immediately make a copy of it on L464, so couldn't we just declare a local slice for 'data_copy'? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@511 PS6, Line 511: Nit: unnecessary space here. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@514 PS6, Line 514: ColumnPredicatePB::BloomFilter *bloom_filter Nit: ColumnPredicatePB::BloomFilter* bloom_filter http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@595 PS6, Line 595: *predicate = ColumnPredicate::IsNull(col); : break; Nit: while you're here, could you fix this indentation? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@601 PS6, Line 601: < 1 Nit: == 0 is more clear http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@603 PS6, Line 603: on bloom filter contained Nit: no bloom filters http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@607 PS6, Line 607: bloom filter Nit: "in bloom filter" http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@608 PS6, Line 608: no nhash or bloom filter or hash algorithm Nit: "missing bloom filter details" http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@612 PS6, Line 612: bloom_filters.push_back(bloom_filter); emplace_back http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc File src/kudu/tablet/cfile_set-test.cc: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@21 PS6, Line 21: Nit: no empty line here. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@109 PS6, Line 109: 2nth Nit: "(2n)th" so that it doesn't seem like you're trying to say "second". Below too. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@110 PS6, Line 110: from Nit: form. Below too. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@113 PS6, Line 113: BloomFilterBuilder* bf1_contain, Nit: put this one on its own line for better symmetry with the others. Below too. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@213 PS6, Line 213: const shared_ptr<CFileSet> & Nit: const shared_ptr<CFileSet>& http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@532 PS6, Line 532: // BloomFilter of column 0 contain and column 1 contain Nit: terminate with a period. http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h File src/kudu/util/bloom_filter.h: http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@57 PS6, Line 57: explicit BloomKeyProbe(const Slice &key) : key_(key) { Can you rewrite this constructor to chain to the new one? http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@68 PS6, Line 68: explicit BloomKeyProbe(const Slice &key, HashAlgorithmInBloomFilter hash_algorithm) : The explicit keyword is only for constructors with one argument. https://google.github.io/styleguide/cppguide.html#Implicit_Conversions http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@69 PS6, Line 69: key_(key) { Reformat like this: explicit BloomKeyProbe(arg1 a, arg2 b) : key_(key) { ... } http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@76 PS6, Line 76: 0); Can you reformat: /*seed=*/ 0); -- To view, visit http://gerrit.cloudera.org:8080/11100 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I62c2de42667d0255d94e19db773240f7f9ee636c Gerrit-Change-Number: 11100 Gerrit-PatchSet: 6 Gerrit-Owner: ZhangYao <triplesheep0...@gmail.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Dan Burkert <danburk...@apache.org> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: ZhangYao <triplesheep0...@gmail.com> Gerrit-Comment-Date: Wed, 19 Sep 2018 17:39:09 +0000 Gerrit-HasComments: Yes