[kudu-CR] KUDU-1165: Implement partition pruning for C++ client and server

Adar Dembo (Code Review) Tue, 22 Mar 2016 19:32:42 -0700

Adar Dembo has posted comments on this change.

Change subject: KUDU-1165: Implement partition pruning for C++ client and server
......................................................................



Patch Set 20:

(28 comments)

http://gerrit.cloudera.org:8080/#/c/2413/20/src/kudu/common/partition_pruner-test.cc
File src/kudu/common/partition_pruner-test.cc:

Line 34: #include "kudu/util/auto_release_pool.h"
I don't see this used. Could you double check the other includes too?


http://gerrit.cloudera.org:8080/#/c/2413/20/src/kudu/common/partition_pruner.cc
File src/kudu/common/partition_pruner.cc:

Line 77:     // Don't bother if there are no lower/upper PK bounds
The code "doesn't bother" if both the lower and upper bounds are missing. But 
the comment is a little ambiguous: should we also not bother if only one of the 
two is missing, but the other is present?

I'm guessing the code is fine but the comment could be a touch more explicit, 
like "Don't bother if both the lower and upper PK bounds are missing."


Line 135:   Arena arena(max<size_t>(Arena::kMinimumChunkSize, 
schema.key_byte_size()), 4096);
Would be good to add a comment explaining the choices in this constructor.


Line 165:   // DISTRIBUTE BY RANGE (c)
As we discussed, DISTRIBUTE BY RANGE without any split rows has no effect, so 
it might be worth omitting from this example.

Alternatively, you could describe it in the design doc and mention how it 
dovetails in with manual tablet splits to provide an actual use case.

But as written, I had no idea what it meant when I came across it.


Line 173:   // | a = 0      | [(bucket=0, bucket=2, c=0), (bucket=0, bucket=2, 
c=1)) |
Nit: where is the end bracket for each of these examples?


Line 182:   // |            | [(bucket=0, bucket=2, c=0), (bucket=1, bucket=2, 
c=1)) |
I don't understand this example.

The equality predicates mean that, between our six tablets, we should scan a 
small amount of data (c=0) on three of them. The first partition key range 
agrees with that, but the second and third specify an end-of-range at 
a_bucket=1, not a_bucket=0 (what I expected to see). Why is that?


Line 202:   // components and a range component, then then a few patterns 
emerge from the
Nit: double then


Line 203:   // examples above:
Nit: terminology-wise, perhaps it'd be clearer if you used the word "column" 
instead of "component"? You do use both in the design doc, but I think you use 
them to mean the same thing, so it'd be nice to stick to one throughout the 
comments and the doc.


Line 213:   //    it an exclusive key.
This seems to be somewhat true if the final constrained component is in a range 
too; in your examples above, whenever c=0 the resulting range began at c=0 and 
ended at c=1.


Line 217:   //    before a final constrained component.
You should also mention the "base case": if there are no unconstrained hash 
components, there is only one partition key range.


Line 221:   std::string range_upper_bound;
Nit: don't need std:: prefix.


Line 224:     if (AreRangeColumnsPrefixOfPrimaryKey(schema, range_columns)) {
For sanity checking, should we first verify that schema and partition_schema 
agree somewhat? For example, AreRangecolumnsPrefixOfPrimaryKey() will iterate 
off the end of schema if range_columns is somehow longer.


Line 271:   // The index of the final constrained hash component in the 
partition key.
Okay, it's roughly from here (well, from the std::distance call) that my eyes 
began glazing over. To help readers, it may be useful to include in each "step 
x: ..." comment a block that explains what the state of the world is at that 
point, or at least what state this step mutates and how. For example:

  // Step 1: Foo the bar.
  //
  // At this point, baz1, baz2, and baz3 are all empty.

  ...

  // Step 2: Blargle the argle.
  //
  // When we finish this step, baz1 will still be empty,
  // but baz2 will be {1, 2, 4}, because of the blargling.

  ...

  // Step 3: Frobby the robby.
  //
  // We're done. baz2 is still {1, 2, 4}, and the frobbing
  // has changed to be baz1 "doink" and baz3 to "dink",
  // which makes sense due to reasons.

This is a dumb example, but hopefully you see what I'm getting at: it helps 
chew on each piece of the algorithm independently.


Line 283:   // Create the set of ranges.
Nit: stopped using step numbers?


Line 323:     get<0>(range).append(range_lower_bound);
Can range_lower_bound/range_upper_bound be empty here? I guess append() will 
just be a no-op then.


Line 327:   // Remove all ranges past the scan spec's upper bound partition key 
key.
Nit: key key


Line 330:       if (!get<1>(*range).empty() &&
Why do we check for empty() on the end of the range but not on the beginning, 
below? Is it because an empty string will naturally compare the way we want it 
to, during a <= comparison?


Line 367:   for (auto range = partition_key_ranges_.rbegin();
Looks similar to the "remove all ranges past..." logic above, but is just 
different enough.


Line 391:          partition.partition_key_end() <= get<0>(*range));
Nit: indent by one more character to show that it's part of the parens.


http://gerrit.cloudera.org:8080/#/c/2413/20/src/kudu/common/partition_pruner.h
File src/kudu/common/partition_pruner.h:

Line 18: #pragma once
So this is what we're doing going forward, instead of the older style of 
include guards?


Line 38: class PartitionPruner {
May be useful to add in the class comment that partition keys are represented 
as type std::string. Any interesting encoding the reader should be aware of? Or 
should it be treated as an opaque array of bytes?


Line 41:    PartitionPruner() = default;
Nit: indentation is a little off.


Line 44:   // already be optimized.
It isn't clear what "optimized" means in this context.


Line 50:   bool HasMorePartitions() const;
Nit: as a general comment, this domain is fraught with terminology (as you saw 
in my comments on the design spec). Let's minimize that where possible. In this 
particular case, we're referring to things to be scanned as either "partitions" 
or "partition key ranges". Can we stick to one of those terms throughout?


Line 55:   // Removes partition key ranges through the provided exclusive upper 
bound.
Nit: Removes all partition key ranges...


Line 60:   // Used for testing.
Only for testing? If so, add ForTests() to the method name.


Line 63:   std::string ToString(const Schema& schema, const PartitionSchema& 
partition_schema) const;
Normally ToString() is a no-arg method. Given that Init() is provided the 
schema and partition_schema, would it be unreasonable to retain pointers to 
those objects and reuse them here?


http://gerrit.cloudera.org:8080/#/c/2413/20/src/kudu/util/memory/arena.h
File src/kudu/util/memory/arena.h:

Line 270:   static const size_t kMinimumChunkSize;
Add a comment here?

Also, shouldn't this be exposed in ThreadSafeArena too? Maybe it should be a 
field within ArenaBase? Will it be inherited then?


-- 
To view, visit http://gerrit.cloudera.org:8080/2413
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9f4fc14b44af63a17b34cd9c2a9134127b1afe4d
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1165: Implement partition pruning for C++ client and server

Reply via email to