Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12205 )

Change subject: KUDU-2645: tablet: Support deduplication of deleted rows in 
MergeIterator
......................................................................


Patch Set 3:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG@19
PS3, Line 19: Because of this new behavior, while the MergeIterator will 
deduplicate
            : deleted rows if they are included in the result set, the 
UnionIterator
            : will not deduplicate them and instead return all instances found.
I might reword this to state that it's not possible to efficiently support this 
deduplication _without_ a merge-like process, so there's no reason to extend it 
to the UnionIterator.


http://gerrit.cloudera.org:8080/#/c/12205/3//COMMIT_MSG@22
PS3, Line 22:
            : One tangentially-related change in this patch is that TestMerge in
            : generic_iterators-test.cc was modified to no longer generate 
duplicate
            : non-deleted keys for merge testing
This makes sense given what you're trying to do, but it's a bit of a bummer in 
that it makes the MergeIterator less "generic".


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators-test.cc
File src/kudu/common/generic_iterators-test.cc:

PS3:
Could you add a test case for include_deleted_rows=true that uses these simple 
VectorIterators? That'll help assess the impact of deduplication.

Separately, could you assess the impact of the MergeIterator change when 
!include_deleted_rows? See my dominance patch for some ideas on how to 
benchmark.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators-test.cc@232
PS3, Line 232:       seen.insert(potential);
Nit: InsertOrDie. Or EmplaceOrDie (if it works).


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.h
File src/kudu/common/generic_iterators.h:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.h@38
PS3, Line 38: schema
Nit: schemas (since this is a requirement for every sub-iterator).


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@300
PS3, Line 300:     : opts_(opts),
std::move(opts)


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@419
PS3, Line 419:   vector<MergeIterState*> smallest(states_.size());
Can we micro-optimize the !include_deleted_rows case such that the operations 
(and allocation) of the vector are avoided?


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@432
PS3, Line 432:     for (const auto& iter : states_) {
There's a non-obvious state machine in here; could you document it?


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@439
PS3, Line 439:         smallest = { iter.get() };
Is this is a copy assignment? For the common case where there are no 
duplicates, is this as efficient as overwriting the first element?


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@441
PS3, Line 441:         smallest.push_back(iter.get());
Nit: emplace_back for new code.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@465
PS3, Line 465:           // We found the single live instance of the row.
In DEBUG builds, should we verify that any remaining instances are ghosts?


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@479
PS3, Line 479:       RETURN_NOT_OK(CopyRow(row_to_return->next_row(), &dst_row, 
dst->arena()));
I think this can be factored out of both sides of the if statement if you 
declare row_to_return on L448 and set it appropriately in the 
!include_deleted_rows case.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@487
PS3, Line 487:     // Remove exhausted sub-iterators from the list we merge 
from.
Any reason not to combine this loop with the one that advances the 
sub-iterators just above?


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@504
PS3, Line 504:
             :   // Maintain a record of the number of rows actually copied to 
the destination
             :   // RowBlock.
I think this is mostly self-explanatory from the code. What would be 
interesting to document, though, is that the possibility of deduplication is 
what makes the previous dst->Resize() call in PrepareBatch insufficient.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/common/generic_iterators.cc@536
PS3, Line 536:   return unique_ptr<RowwiseIterator>(new MergeIterator(opts, 
std::move(iters)));
std::move(opts)


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/diff_scan-test.cc
File src/kudu/tablet/diff_scan-test.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/diff_scan-test.cc@61
PS3, Line 61:   ASSERT_TRUE(order_mode == UNORDERED || order_mode == ORDERED)
            :       << "unknown order mode: " << OrderMode_Name(order_mode);
Doesn't seem necessary to me; the list of values is just a few lines above.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/diff_scan-test.cc@97
PS3, Line 97:   opts.include_deleted_rows = true;
Would be cool to parameterize on this too.


http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/rowset.cc
File src/kudu/tablet/rowset.cc:

http://gerrit.cloudera.org:8080/#/c/12205/3/src/kudu/tablet/rowset.cc@105
PS3, Line 105:     case ORDERED: {
Why do you need the new scope here? Same for tablet.cc.



--
To view, visit http://gerrit.cloudera.org:8080/12205
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00614b3fa5c6b4e7b620bb78489e24c5ad44daee
Gerrit-Change-Number: 12205
Gerrit-PatchSet: 3
Gerrit-Owner: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Comment-Date: Wed, 13 Feb 2019 05:22:37 +0000
Gerrit-HasComments: Yes

Reply via email to