Todd Lipcon has posted comments on this change.

Change subject: Predicate evaluation pushdown
......................................................................


Patch Set 4:

(39 comments)

did a first pass of comments. i think splitting out the optimization on 
predicate evaluation to avoid the extra indirections is a worthy thing to do 
separately from the pushdown stuff

http://gerrit.cloudera.org:8080/#/c/3990/4/src/kudu/cfile/block_encodings.h
File src/kudu/cfile/block_encodings.h:

Line 138:   virtual Status SeekForward(size_t* n) = 0;
I think it'd be better to use 'int' here, so you can DCHECK_GE(*n, 0) at the 
start. We used to use size_t more (this is old code) but int's safer because an 
underflow bug will be more obvious


Line 157:   // function is that ctx->eval_complete() is true, and false 
otherwise.
I think there's a subtle requirement here that, given a particular encoding 
type and a particular predicate, this must completely determine whether 
eval_complete will get set. In other words, a decoder can't be data-dependent 
on whether to set eval_complete, right?

Otherwise we could have something like:

cfile block 1: decoder says "not complete"
cfile block 2: decode says "complete"
top-level iterator code says "aha! it's complete" and decides that it doesn't 
need to evaluate predicates.

Making 'eval_complete' a tri-state (EvaluationState) with an enum like 
{NOT_STARTED, EVALUATED, NOT_EVALUATED} and ensuring that you can only move 
from NOT_STARTED to one of the others, but not between the other two, would be 
a good way to add some safety here


Line 162:     CopyNextValues(n, dst);
RETURN_NOT_OK


http://gerrit.cloudera.org:8080/#/c/3990/4/src/kudu/cfile/cfile_reader.cc
File src/kudu/cfile/cfile_reader.cc:

Line 670:   // If seeking within the dasta block results in NotFound, then that 
indicates that the
typo


Line 728:     // No ctx_->sel() defined implies there is no predicate 
evaluation at the
I think a getter like 'ctx_->try_decoder_predicate_evaluation()' would be a bit 
nicer here


Line 733:     // this happens are kept for compatibility with older test cases.
would rather add a wrapper in the test code instead of adding a compatibility 
path here in the real code


Line 747:       LOG(INFO) << BitmapToString(pred_set_->bitmap(), 
pred_set_->nrows());
should remove this


Line 977:           // These will be null if gflag for decoder eval is off
again a getter would centralize these checks and make the code more readable 
here


http://gerrit.cloudera.org:8080/#/c/3990/4/src/kudu/cfile/cfile_reader.h
File src/kudu/cfile/cfile_reader.h:

Line 24: #include "kudu/common/column_eval_context.h"
forward decl is sufficient


Line 222:   virtual Status PrepareScan(rowid_t ord_idx, ColumnEvalContext *ctx) 
= 0;
would be nice to avoid this "duplicate API" thing where one set of things gets 
called when the feature is on, and the other gets called otherwise. It ends up 
being multiplicative in the number of code paths that you have to consider when 
thinking about the API, which is a lot of cognitive overhead. Can we change 
things such that we _always_ have a ColumnEvalContext, but just set some flag 
in it when we don't want to attempt the pushdown?


Line 261:   virtual Status Scan(ColumnEvalContext *ctx) = 0;
see above, would be good to collapse the two APIs


Line 394:   // single set of satisfying predicates.
this isn't returning a set of predicates, though -- it's returning a set of 
codeword indexes


Line 395:   SelectionVector* GetPredicateSet() { return pred_set_.get();}
should rename to something more specifically about dictionary... maybe 
GetCodeWordsMatchingPredicate or something?


Line 468:   std::unique_ptr<SelectionVector>pred_set_;
rename (per above)


http://gerrit.cloudera.org:8080/#/c/3990/4/src/kudu/common/column_eval_context.h
File src/kudu/common/column_eval_context.h:

Line 1: #pragma once
nit: license header


Line 6: // A ColumnEvalContext provides a clean interface to the set of objects 
that get passed down to
doc belongs above the class, not above the namespace


Line 7: // the decoders during predicate pushdown
nit: add period at end of sentence


Line 12:   ColumnEvalContext(const size_t col_idx,
const primitive doesn't make much sense


Line 16:                     bool* eval_complete) :
please document the lifetime of these pointer arguments (i.e they must outlive 
the context object)

also, I wonder whether some of these items could actually just be made members 
of ColumnEvalContext itself instead of passed in as pointers? (in particular 
eval_complete)


http://gerrit.cloudera.org:8080/#/c/3990/4/src/kudu/common/column_predicate.cc
File src/kudu/common/column_predicate.cc:

Line 361: void ColumnPredicate::Evaluate(const ColumnBlock& block, 
SelectionVector* sel) const {
it seems like this optimization is entirely separate from the "push down 
predicate evaluation into the block decoders" work, right? can we split it into 
a separate patch to properly quantify the improvement?


http://gerrit.cloudera.org:8080/#/c/3990/4/src/kudu/common/column_predicate.h
File src/kudu/common/column_predicate.h:

Line 128:   // This should only be called when a query has ranges specified
this comment can be more general -- i.e "when the predicate is based on a 
cell's value." (also add period)


Line 129:   // i.e. None predicates and IsNotNull predicates should not use this
nit: "should not" is vague. "Do not"? What happens if they do? It seems funny 
that here you say it shouldn't be used, but then the .cc file has some code 
which seems to implement those cases


Line 130:   bool EvaluateCell(DataType Type, const void *cell) const;
should clarify that this is a physical_type (same below)


PS4, Line 132: e <D
nit: no space after 'template'. (same below)


http://gerrit.cloudera.org:8080/#/c/3990/4/src/kudu/common/generic_iterators.cc
File src/kudu/common/generic_iterators.cc:

Line 505:   // TODO: add a condition: If the block has support for predicate 
pushdown, call predicate pushdown
is this still relevant?


Line 508:   // Based on FLAGS_materializing_iterator_decoder_eval, will do 
decoder eval
don't think this is necessary


Line 523:     bool eval_complete = false;
making eval_complete a member of ColumnEvalContext seems easier than passing in 
the pointer to the stack-var


http://gerrit.cloudera.org:8080/#/c/3990/4/src/kudu/common/iterator.h
File src/kudu/common/iterator.h:

Line 24: #include "kudu/common/column_predicate.h"
I think you can get away with forward decls here


Line 102:   virtual Status MaterializeColumn(size_t col_idx, ColumnBlock *dst) 
= 0;
is this still needed? It seems like the code is now calling the other overload


Line 103:   virtual Status MaterializeColumn(ColumnEvalContext *ctx) = 0;
need doc.


http://gerrit.cloudera.org:8080/#/c/3990/4/src/kudu/common/rowblock.h
File src/kudu/common/rowblock.h:

Line 121: class SelectionVectorView {
doc


PS4, Line 137: _+r
nit: space around ' + ' (here and below)


Line 141:     if (sel_vec_) {
all these if()s seem expensive as these are hot-path functions. Maybe we can 
cross our fingers that the compiler will hoist the checks, but I'm skeptical. 
Where would we use a null sel-vec?


Line 168:   SelectionVector *sel_vec_;
SelectionVector * const sel_vec_;
(so you can't re-assign a different pointer later)


http://gerrit.cloudera.org:8080/#/c/3990/4/src/kudu/common/types.h
File src/kudu/common/types.h:

Line 82:   const std::function<int(const void*, const void*)> 
comparator_getter_;
hrm, this is actually the comparator itself, not a getter which then returns a 
comparator


PS4, Line 378:  return [](const void* lhs, const void* rhs) {
             :       const Slice *lhs_slice = reinterpret_cast<const Slice 
*>(lhs);
             :       const Slice *rhs_slice = reinterpret_cast<const Slice 
*>(rhs);
             :       return lhs_slice->compare(*rhs_slice);
do we know whether returning a lambda with no captures is equivalent to 
returning a function pointer (which will get wrapped in std::function<>?)


http://gerrit.cloudera.org:8080/#/c/3990/4/src/kudu/tablet/delta_store.h
File src/kudu/tablet/delta_store.h:

Line 176:   // Returns true if there there exists deltas to be applied
typo: "there there"
- should clarify whether this means only in the current batch, or in the 
iterator at all. should clarify whether it is conservative or accurate (i.e. is 
it always safe to return true?)

suggestion: rename this to "MayHaveDeltas" since it actually returns true in 
the case of any type of delta, not just UPDATE, and also the 'may' part implies 
that it's allowed to return true conservatively (even if in fact there are none)


http://gerrit.cloudera.org:8080/#/c/3990/4/src/kudu/tablet/deltafile.cc
File src/kudu/tablet/deltafile.cc:

Line 821:   return HasNext();
I think this could be better done by:
- require that it be done in the context of a single batch (i.e 
DCHECK(prepared_) and DCHECK_)
- should have the semantics that it returns true if there are any updates to be 
applied in the current batch. (in other words, return true unless 
ApplyUpdates() would be a no-op)
- can then be implemented based on the delta_blocks_ member -- can check if 
it's empty, or if the first block is outside of the current prepared batch

Another potential TODO here is to pass in the column index here (makes it more 
parallel with the ApplyUpdates API), and then in the future we can be smart 
enough to see the case where a predicate is on a non-updated column even if 
there are other columns that have updates)


http://gerrit.cloudera.org:8080/#/c/3990/4/src/kudu/tablet/tablet-decoder-eval-test.cc
File src/kudu/tablet/tablet-decoder-eval-test.cc:

Line 1: #include <glog/logging.h>
nit: license header


-- 
To view, visit http://gerrit.cloudera.org:8080/3990
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I31e4cce21e99f63b089d7c84410af8ed914cb576
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <andrew.w...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <d...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

Reply via email to