Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11100 )

Change subject: Implement BloomFilter Predicate in server side.
......................................................................


Patch Set 6:

(68 comments)

Looks good, most of my comments are stylistic.

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc
File src/kudu/common/column_predicate-test.cc:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@45
PS6, Line 45: static const int kRandomSeed = 0xdeadbeef;
Can you use SeedRandom() instead? That way we'll use a different seed for every 
run of the test and get additional test coverage while retaining the ability to 
rerun a test with a particular seed using --test_random_seed.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@77
PS6, Line 77:                                        BloomFilterBuilder *bfb1,
            :                                        BloomFilterBuilder *bfb2) {
Nit: BloomFilterBuilder* bfb1


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@79
PS6, Line 79:     srandom(random_seed);
Don't need if you use SeedRandom().


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@83
PS6, Line 83:         uint64_t key = random();
Use util/Random instead.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@88
PS6, Line 88: const uint8_t *
const uint8_t*

(the theme here is that when using a pointer type, the asterisk should be 
adjacent to the type rather than separated by a space).


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@92
PS6, Line 92:         values->push_back(key);
emplace_back in new code. Elsewhere too.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@791
PS6, Line 791:                                         
std::vector<ColumnPredicate::BloomFilterInner>* bf,
Don't need std:: prefix. Below too.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@793
PS6, Line 793:     std::vector<ColumnPredicate::BloomFilterInner> hold = *bf;
Nit: maybe name "orig_bloom_filters" to better convey its purpose?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@828
PS6, Line 828:     std::vector<ColumnPredicate::BloomFilterInner> bf_bak = *bf;
bf_copy maybe?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@1382
PS6, Line 1382:   int n_keys = 5; // 0 1both hit bf1 and bf2, 2 3 4 only hit 
bf2.
0 1 both hit


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@1414
PS6, Line 1414:   // 0 1both hit bf1 and bf2, 2 3 4 only hit bf2.
0 1 both hit


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate-test.cc@1429
PS6, Line 1429:   // Test for BINARY type
Each of these "Test for ... type" is independent of the other tests, right? 
Wouldn't it be cleaner to declare them each as a separate TEST_F?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h
File src/kudu/common/column_predicate.h:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@275
PS6, Line 275:
             :     const Slice& bloom_data() const {
             :       return bloom_data_;
             :     }
             :
             :     size_t nhash() const {
             :       return nhash_;
             :     }
             :
             :     HashAlgorithmInBloomFilter hash_algorithm() const {
             :       return hash_algorithm_;
             :     }
             :
             :     void set_nhash(size_t nhash) {
             :       nhash_ = nhash;
             :     }
             :
             :     void set_bloom_data(Slice bloom_data) {
             :       bloom_data_ = bloom_data;
             :     }
             :
             :     void set_hash_algorithm(HashAlgorithmInBloomFilter 
hash_algorithm) {
             :       hash_algorithm_ = hash_algorithm;
             :     }
I don't see the point of these getters and setters since the members are all 
public to begin with. Why not just access the members directly?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@301
PS6, Line 301:       return (bloom_data_ == other.bloom_data_ && nhash_ == 
other.nhash_
             :               && hash_algorithm_ == other.hash_algorithm_);
Nit: reformat like:

  return (bloom_data_ == other.bloom_data_ &&
          nhash_ == other.nhash_ &&
          hash_algorithm_ == other.hash_algorithm_);

That way it's less noisy to add new members.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@304
PS6, Line 304:     // The slice of bloom filter.
Nit: separate each member from the member before it (or from operator==) with 
an empty line.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@308
PS6, Line 308: , currently support
             :     // murmur2 and cityhash.
I don't think this comment is necessary; the list of supported algorithms is 
embodied by the set of enum values.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@354
PS6, Line 354: const ColumnPredicate &
Nit: const ColumnPredicate&


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.h@425
PS6, Line 425:   // The vector of bloom filter in this predicate.
Nit: "The list of bloom filters in this predicate."


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc
File src/kudu/common/column_predicate.cc:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@68
PS6, Line 68:         : predicate_type_(predicate_type),
            :           column_(move(column)),
            :           lower_(lower),
            :           upper_(upper) {
Nit: use the same indentation as in the other two constructors.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@113
PS6, Line 113:   CHECK(!bfs->empty());
Should check that bfs != nullptr.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@189
PS6, Line 189:   bloom_filters_.clear();
Hmm, why don't we clear values_ in this function? Why is it necessary to clear 
bloom_filters_ but not values_?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@226
PS6, Line 226:           predicate_type_ = PredicateType::None;
             :           lower_ = nullptr;
             :           upper_ = nullptr;
Could use SetToNone here.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@254
PS6, Line 254:     case PredicateType::InBloomFilter: {
If we allowed an empty list of bloom filters, could that simplify into 
something else?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@264
PS6, Line 264:           if (CheckValueInBloomFilter(lower_)) {
             :             predicate_type_ = PredicateType::Equality;
             :             upper_ = nullptr;
             :             bloom_filters_.clear();
             :           } else {
             :             SetToNone();
             :           }
Can you explain the rationale behind this simplification? It's not obvious to 
me.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@275
PS6, Line 275:           upper_ = nullptr;
Isn't upper_ guaranteed to be nullptr in this branch?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@277
PS6, Line 277:           if (CheckValueInBloomFilter(lower_)) {
             :             predicate_type_ = PredicateType::Equality;
             :             upper_ = nullptr;
             :             bloom_filters_.clear();
             :           } else {
             :             SetToNone();
             :           }
Also not understanding this simplification; could you explain in a comment?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@287
PS6, Line 287:           SetToNone();
Why is this case different from the L274-L275 case? That is, why do we use 
SetToNone() here but not there?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@443
PS6, Line 443:       if (!other.CheckValueInBloomFilter(lower_)) {
             :         SetToNone();
             :       }
Can you explain this?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@616
PS6, Line 616:       // Merge the optional lower and upper bound.
             :       if (other.lower_ != nullptr &&
             :           (lower_ == nullptr || 
column_.type_info()->Compare(lower_, other.lower_) < 0)) {
             :         lower_ = other.lower_;
             :       }
             :       if (other.upper_ != nullptr &&
             :           (upper_ == nullptr || 
column_.type_info()->Compare(upper_, other.upper_) > 0)) {
             :         upper_ = other.upper_;
             :       }
This pattern seems to repeat itself quite a bit in this class. Could you 
decompose it into one or two helper functions in an anonymous namespace?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@630
PS6, Line 630:         // value falls in list so change to Equality predicate
Nit: "bloom filter", not "list"


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@636
PS6, Line 636:  in list
in bloom filter


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@647
PS6, Line 647:       std::vector<const void *> new_values;
Nit: vector<const void*> new_values;

(no need for std:: prefix either).


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@650
PS6, Line 650:           new_values.push_back(value);
Nit: emplace_back in new code.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/column_predicate.cc@870
PS6, Line 870:       return strings::Substitute("`$0` IS BloomFilter", 
column_.name());
IS IN BloomFilter, perhaps?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@357
PS6, Line 357:   // The bloomfilter for quick filter the row.
What does "quick filter the row" mean? Could you elaborate?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@362
PS6, Line 362:     // And We currently use CityHash backend.
Doesn't this depend on the value of hash_algorithm?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@365
PS6, Line 365:     optional HashAlgorithmInBloomFilter hash_algorithm = 3;
Does this need a default value specifier for CITY_HASH?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@377
PS6, Line 377: We will eventually add a special
             :     // predicate type for null-ness.
Nit: not your fault, but could you remove this sentence? We've since added 
IsNotNull and IsNull predicates.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@404
PS6, Line 404:     repeated BloomFilter bloom_filters = 1;
Nit: add a comment explaining what this is.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@406
PS6, Line 406:     // Lower and Upper is optional for InBloomFilter.
How do lower and upper work with bloom filters? Can you doc in a comment?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/common.proto@408
PS6, Line 408:     optional bytes lower = 3 [(kudu.REDACT) = true];
Use field id 2 here and 3 for 'upper'; no reason to skip 2 since there wasn't a 
field 2 to begin with.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/key_util.cc
File src/kudu/common/key_util.cc:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/key_util.cc@234
PS6, Line 234:       case PredicateType::InBloomFilter:  // Upper in 
InBloomFilter process as upper in Range.
Nit: "upper in InBloomFilter processed as upper in Range"

Below too (with 'lower' instead of 'upper')


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc
File src/kudu/common/wire_protocol-test.cc:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@44
PS6, Line 44: #include "kudu/util/bloom_filter.h"
Not sorted correctly.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@508
PS6, Line 508:   Arena arena(1024);
WireProtocolTest already has an arena, could reuse it?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@536
PS6, Line 536:     ASSERT_NO_FATAL_FAILURE(ColumnPredicateToPB(ibf, &pb));
NO_FATALS in new code. Below too.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol-test.cc@555
PS6, Line 555: TEST_F(WireProtocolTest, 
TestColumnPredicateBloomFilterWithBound) {
The setup code is identical in both tests; can you decompose it into a new test 
fixture? Something like:

  class BFWireProtocolTest : public WireProtocolTest {
    <members or functions for these two tests>
  }

  TEST_F(BFWireProtocolTest, Blah) {
    ...
  }


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc
File src/kudu/common/wire_protocol.cc:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@416
PS6, Line 416:   const void* src;
             :   size_t size;
             :   src = bf_src.bloom_data().data();
             :   size = bf_src.bloom_data().size();
Combine these four lines into two (i.e. combine declaration and assignment).


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@455
PS6, Line 455: // Extract BloomFilterInner for bloom data for 
ColumnBloomFilterPredicate.
from


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@461
PS6, Line 461:   uint8_t* data_copy = 
static_cast<uint8_t*>(arena->AllocateBytes(bf_src.bloom_data().size()));
bf_src.bloom_data().size() is repeated three times; can you store in a local 
variable with a shorter name?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@463
PS6, Line 463:   const Slice* slice = arena->NewObject<Slice>(data_copy, 
bf_src.bloom_data().size());
Why do we need to allocate the slice itself from the arena? Seems like we turn 
around and immediately make a copy of it on L464, so couldn't we just declare a 
local slice for 'data_copy'?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@511
PS6, Line 511:
Nit: unnecessary space here.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@514
PS6, Line 514: ColumnPredicatePB::BloomFilter *bloom_filter
Nit: ColumnPredicatePB::BloomFilter* bloom_filter


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@595
PS6, Line 595:         *predicate = ColumnPredicate::IsNull(col);
             :         break;
Nit: while you're here, could you fix this indentation?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@601
PS6, Line 601: < 1
Nit: == 0 is more clear


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@603
PS6, Line 603: on bloom filter contained
Nit: no bloom filters


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@607
PS6, Line 607:  bloom filter
Nit: "in bloom filter"


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@608
PS6, Line 608: no nhash or bloom filter or hash algorithm
Nit: "missing bloom filter details"


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/common/wire_protocol.cc@612
PS6, Line 612:         bloom_filters.push_back(bloom_filter);
emplace_back


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc
File src/kudu/tablet/cfile_set-test.cc:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@21
PS6, Line 21:
Nit: no empty line here.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@109
PS6, Line 109: 2nth
Nit: "(2n)th" so that it doesn't seem like you're trying to say "second". Below 
too.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@110
PS6, Line 110: from
Nit: form. Below too.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@113
PS6, Line 113: BloomFilterBuilder* bf1_contain,
Nit: put this one on its own line for better symmetry with the others. Below 
too.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@213
PS6, Line 213: const shared_ptr<CFileSet> &
Nit: const shared_ptr<CFileSet>&


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/tablet/cfile_set-test.cc@532
PS6, Line 532:   // BloomFilter of column 0 contain and column 1 contain
Nit: terminate with a period.


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h
File src/kudu/util/bloom_filter.h:

http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@57
PS6, Line 57:   explicit BloomKeyProbe(const Slice &key) : key_(key) {
Can you rewrite this constructor to chain to the new one?


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@68
PS6, Line 68:   explicit BloomKeyProbe(const Slice &key, 
HashAlgorithmInBloomFilter hash_algorithm) :
The explicit keyword is only for constructors with one argument.

https://google.github.io/styleguide/cppguide.html#Implicit_Conversions


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@69
PS6, Line 69:                          key_(key) {
Reformat like this:

  explicit BloomKeyProbe(arg1 a, arg2 b)
      : key_(key) {
    ...
  }


http://gerrit.cloudera.org:8080/#/c/11100/6/src/kudu/util/bloom_filter.h@76
PS6, Line 76:                 0);
Can you reformat:

  /*seed=*/ 0);



--
To view, visit http://gerrit.cloudera.org:8080/11100
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I62c2de42667d0255d94e19db773240f7f9ee636c
Gerrit-Change-Number: 11100
Gerrit-PatchSet: 6
Gerrit-Owner: ZhangYao <triplesheep0...@gmail.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <danburk...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: ZhangYao <triplesheep0...@gmail.com>
Gerrit-Comment-Date: Wed, 19 Sep 2018 17:39:09 +0000
Gerrit-HasComments: Yes

Reply via email to