Marcel Kornacker has posted comments on this change. Change subject: IMPALA-4624: Implement Parquet dictionary filtering ......................................................................
Patch Set 8: (29 comments) http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 532: RETURN_IF_ERROR(InitDictionaries(column_readers_)); the control flow is hard to follow because it the top-level functions don't reflect it. why not divide the column readers into two groups (w/ dict filter and w/o) and have two explicit loops: a) loop over first group: initialize dictionary, check whether it's a filtering candidate, then evaluate filter; b) loop over second group: initialize dictionary. that way, you can also get rid of dict_initialized_. Line 607: // values are not validated, so skip these datatypes for now. also leave todo, this can be done during dictionary construction. Line 615: Status HdfsParquetScanner::EvalDictionaryFilters(int row_group_idx, indentation pass in RowGroup& directly, you don't use the index for anything other than indexing into row_groups Line 628: for (ParquetColumnReader* col_reader : dict_filter_column_readers_) { there is a lot of code here that doesn't evaluate a dictionary. restructure it to make the division of labor and control flow explicit. Line 637: vector<ExprContext*>& slot_conjunct_ctxs = slot_conjunct_it->second; move to where variables are used. Line 701: // here or in ReadDataPage but never both. is that still true? i thought you expect the first page to be dictionary-encoded? also, where is the dict page offset from the metadata being utilized? Line 710: // TODO: handle the 40000 in a more elegant manner (make it a named constant) resolve todo (don't leave todos that take very little time to resolve). Line 723: column_passes_filter = true; 'passes filter' is ambiguous (does this mean the predicate is true or false?). has_match is not. http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 613: void InitDictionaryFilters(); not intuitively named, i thought this had something to do with dictionary construction (since the dictionary is in effect the filter). rename. http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 244: unordered_set<Encoding::type> column_encodings_; missing comments Line 525: dict_encoding_stats_[dict_header.encoding]++; ++... Line 611: column_encodings_.insert(header.data_page_header.definition_level_encoding); is this useful/required? the def/rep level encodings don't vary. Line 1042: for (Encoding::type encoding : columns_[i]->column_encodings_) { you use columns_[i] a lot, create a variable http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: PS8, Line 462: NeedsConversionInline can this be another template parameter? Line 764: // end of the stream. superfluous comment, the function comment in the header file already makes that clear Line 804: new_buffer_size, &buffer, &new_buffer_size, &status, /* peek */ true); indentation Line 811: if (buffer_size == new_buffer_size) { what does this mean? Line 991: return Status("Column chunk should not contain two dictionary pages."); make this an actionable error message (ie, include the file name). http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: Line 431: /// an indication of whether a dictionary is present. so this is true if no dictionary exists? http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/util/dict-test.cc File be/src/util/dict-test.cc: Line 55: for (T i: values) { do a similar loop for GetValue() http://gerrit.cloudera.org:8080/#/c/5904/8/common/thrift/parquet.thrift File common/thrift/parquet.thrift: Line 1: /** is this the unmodified parquet.thrift? (and it comes without a format version indication?) http://gerrit.cloudera.org:8080/#/c/5904/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 190: avoid double-spaced code Line 332: * that are bound by only that slot. The single-slot restriction allows explicitly mention any member variables that get populated (= spell out side effects). Line 349: boolean isDeterministic = true; instead, add a new predicate to Expr.java (IS_DETERMINISTIC_FN_PREDICATE), this looks like it would be generally useful. Line 359: // This is important for dictionary filtering. Dictionaries do not so these predicates aren't just 'single-slot conjuncts', because they're specifically targeted to the requirements of dictionary filtering. rename. Line 692: if (detailLevel.ordinal() >= TExplainLevel.VERBOSE.ordinal()) { precede w/ blank line Line 698: Collections.sort(totalIdxList); why sort these? Line 700: for (Integer idx : totalIdxList) { single line http://gerrit.cloudera.org:8080/#/c/5904/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test: Line 23: single slot predicates: int_col > 1 might as well call this something like 'dictionary predicates', to make explicit that it is evaluated against a dictionary (plus, 'slot' is internal jargon), and also because we don't have plans to use it for anything else (since min/max predicates have different restrictions). -- To view, visit http://gerrit.cloudera.org:8080/5904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-HasComments: Yes