[Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering

Marcel Kornacker (Code Review) Fri, 17 Feb 2017 16:16:29 -0800

Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering
......................................................................



Patch Set 8:

(29 comments)

http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 532:     RETURN_IF_ERROR(InitDictionaries(column_readers_));
the control flow is hard to follow because it the top-level functions don't 
reflect it.

why not divide the column readers into two groups (w/ dict filter and w/o) and 
have two explicit loops: a) loop over first group: initialize dictionary, check 
whether it's a filtering candidate, then evaluate filter; b) loop over second 
group: initialize dictionary. that way, you can also get rid of 
dict_initialized_.


Line 607:     // values are not validated, so skip these datatypes for now.
also leave todo, this can be done during dictionary construction.


Line 615:   Status HdfsParquetScanner::EvalDictionaryFilters(int row_group_idx,
indentation

pass in RowGroup& directly, you don't use the index for anything other than 
indexing into row_groups


Line 628:   for (ParquetColumnReader* col_reader : dict_filter_column_readers_) 
{
there is a lot of code here that doesn't evaluate a dictionary. restructure it 
to make the division of labor and control flow explicit.


Line 637:     vector<ExprContext*>& slot_conjunct_ctxs = 
slot_conjunct_it->second;
move to where variables are used.


Line 701:     // here or in ReadDataPage but never both.
is that still true? i thought you expect the first page to be 
dictionary-encoded? also, where is the dict page offset from the metadata being 
utilized?


Line 710:     // TODO: handle the 40000 in a more elegant manner (make it a 
named constant)
resolve todo (don't leave todos that take very little time to resolve).


Line 723:         column_passes_filter = true;
'passes filter' is ambiguous (does this mean the predicate is true or false?). 
has_match is not.


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 613:   void InitDictionaryFilters();
not intuitively named, i thought this had something to do with dictionary 
construction (since the dictionary is in effect the filter). rename.


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 244:   unordered_set<Encoding::type> column_encodings_;
missing comments


Line 525:     dict_encoding_stats_[dict_header.encoding]++;
++...


Line 611:   
column_encodings_.insert(header.data_page_header.definition_level_encoding);
is this useful/required? the def/rep level encodings don't vary.


Line 1042:     for (Encoding::type encoding : columns_[i]->column_encodings_) {
you use columns_[i] a lot, create a variable


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS8, Line 462:  NeedsConversionInline
can this be another template parameter?


Line 764:   // end of the stream.
superfluous comment, the function comment in the header file already makes that 
clear


Line 804:       new_buffer_size, &buffer, &new_buffer_size, &status, /* peek */ 
true);
indentation


Line 811:     if (buffer_size == new_buffer_size) {
what does this mean?


Line 991:         return Status("Column chunk should not contain two dictionary 
pages.");
make this an actionable error message (ie, include the file name).


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

Line 431:   /// an indication of whether a dictionary is present.
so this is true if no dictionary exists?


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/util/dict-test.cc
File be/src/util/dict-test.cc:

Line 55:   for (T i: values) {
do a similar loop for GetValue()


http://gerrit.cloudera.org:8080/#/c/5904/8/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

Line 1: /**
is this the unmodified parquet.thrift? (and it comes without a format version 
indication?)


http://gerrit.cloudera.org:8080/#/c/5904/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 190: 
avoid double-spaced code


Line 332:    * that are bound by only that slot. The single-slot restriction 
allows
explicitly mention any member variables that get populated (= spell out side 
effects).


Line 349:       boolean isDeterministic = true;
instead, add a new predicate to Expr.java (IS_DETERMINISTIC_FN_PREDICATE), this 
looks like it would be generally useful.


Line 359:       // This is important for dictionary filtering. Dictionaries do 
not
so these predicates aren't just 'single-slot conjuncts', because they're 
specifically targeted to the requirements of dictionary filtering.

rename.


Line 692:     if (detailLevel.ordinal() >= TExplainLevel.VERBOSE.ordinal()) {
precede w/ blank line


Line 698:         Collections.sort(totalIdxList);
why sort these?


Line 700:         for (Integer idx : totalIdxList) {
single line


http://gerrit.cloudera.org:8080/#/c/5904/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

Line 23:      single slot predicates: int_col > 1
might as well call this something like 'dictionary predicates', to make 
explicit that it is evaluated against a dictionary (plus, 'slot' is internal 
jargon), and also because we don't have plans to use it for anything else 
(since min/max predicates have different restrictions).


-- 
To view, visit http://gerrit.cloudera.org:8080/5904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering

Reply via email to