Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 7: (31 comments) Did a first pass over the main {h,cc} files and left some comments http://gerrit.cloudera.org:8080/#/c/12065/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12065/7//COMMIT_MSG@31 PS7, Line 31: added nit: double word http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@694 PS7, Line 694: for (auto& scalar_reader : scalar_readers_) { nit: single line? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@706 PS7, Line 706: vector<pair<int64_t, int64_t>> skip_ranges; I think a typedef or even a struct might make these easier to read, and you could use a single tuple instead of begin_row and end_row below. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@791 PS7, Line 791: if (col_chunk.offset_index_length != 0) { Can it be <0? Otherwise be explicit about >0 or this could have weird consequences for a broken file (e.g. in our fuzz testing). http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@803 PS7, Line 803: CalculateCandidateRanges Should this method maybe also do the sorting? Then it's contract becomes slightly simpler (i.e., input can be unsorted). http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@813 PS7, Line 813: d Can you use a setter for this, or (ideally, if possible) compute those before constructing the page readers and pass them into the c'tor? I looked through the column readers to see where they are initialized before resorting to grep to come here. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@1150 PS7, Line 1150: tlr nit: spell out? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@380 PS7, Line 380: /// we will not even see the bytes of the filtered out pages. I had to look for it, maybe mention where this is initialized? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@463 PS7, Line 463: /// invokes SkipValues(). Does the different between primitive values and rows come from nested types? If so, maybe mention in a small comment? Please also add a comment what this returns? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@466 PS7, Line 466: /// Skip values in the page data. Add a comment what this returns. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@467 PS7, Line 467: SkipValues This is limited to the page, right? If so, maybe rename to SkipValuesInPage()? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@470 PS7, Line 470: index of the first row This reads like it gets the something like "the first row is at index 10 in the page" where I think it should return "this page starts with row 23". Can you clarify the comment? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@491 PS7, Line 491: int FillPositionsInFilteredRange(int rows_remaining, int max_values, Could we move this implementation out of the header? I think generally only one-liners (full definition) should be here. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@518 PS7, Line 518: /// Advance to the next filtered range that contains 'current_row_'. Are "filtered range", "row range", and "filtered row range" below all the same thing" http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@529 PS7, Line 529: auto current_range = parent_->candidate_ranges_[current_row_range_]; > DCHECK(!candidate_data_pages_.empty()) since this doesn't make sense unless I think we should also DCHECK for out-of-range here and above. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@391 PS7, Line 391: bool ScalarColumnReader<InternalType, PARQUET_TYPE, MATERIALIZED>::SkipValues( This doesn't work for plain encoded strings because they don't have an encoded len, no? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@517 PS7, Line 517: if (NeedToSkipRowsInCurrentPage()) { : auto current_range = parent_->candidate_ranges_[current_row_range_]; : int64_t skip_rows = current_range.first - current_row_ - 1; : DCHECK_GE(skip_rows, 0); : if (!SkipTopLevelRows(skip_rows)) return false; : } else if (candidate_page_idx_ == candidate_data_pages_.size() - 1) { : *num_values = val_count; : num_buffered_values_ = 0; : return val_count > 0; : } This part mostly uses global state, so it could go into its own method. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@555 PS7, Line 555: !candidate_data_pages_.empty() I wonder if the code was more clear if we replace this with a more descriptive function, e.g. DoesPageFiltering() or so, especially in cases where we check candidate_data_pages_ without then actually accessing them below the check? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1097 PS7, Line 1097: vector<ScanRange::SubRange> sub_ranges; Should sub_ranges be built in a separate method? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1514 PS7, Line 1514: nit: remove some of the empty lines http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1653 PS7, Line 1653: num_rows == 0 && rep_levels_.PeekLevel() == 0 Could this be the while condition? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@55 PS7, Line 55: calculates I think I'd prefer "compute" over "calculate" for things that are more than a single expression, here and elsewhere. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@55 PS7, Line 55: This function calculates the union of 'skip_ranges' and subtracts it from : /// [0, num_rows). I feel that sentence could go to the implementation. It's not really relevant to the caller how this method accomplishes what it does, but it helps understand the implementation http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@58 PS7, Line 58: std::vector<std::pair<int64_t, int64_t>> nit: by convention we usually return non-podts through pointers. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@61 PS7, Line 61: the page indexes Does this mean "page number" as in "we need to read the 5th and 8th pages"? Then we could say "computes the pages that intersect..." and explain how these pages are returned. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@64 PS7, Line 64: std::vector<int> nit: return through pointer http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc File be/src/exec/parquet/parquet-common.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@95 PS7, Line 95: skip_end Maybe add a comment to explain what this tracks? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@96 PS7, Line 96: for (auto& skip_range : skip_ranges) { > It might be worth validating the sortedness precondition here by tracking t I think we could also move the sorting here, or add a DCHECK using std::is_sorted. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@100 PS7, Line 100: - 1 Are these inclusive? Generally I feel that the terms [begin,end) signal half-open intervals, where [first, last] signal closed. I'm good with keeping the names but it would probably be good to at least have comments that mention how to interpret the ends of the intervals. If you want to switch to first/last we should have a struct to avoid the confusion with pair<>::first. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@118 PS7, Line 118: int64_t higher_bottom = std::max(lhs.first, rhs.first); bottom and top also could be replaced by first / last, but only if you get rid of the pair. Given the subtleties around half-open vs closed, a struct RowInterval with a comment somewhere might be good. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@123 PS7, Line 123: vector<int> filtered_pages; > filtered_pages is ambiguous. Maybe surviving_pages? remaining_pages? candid I'm in favor of surviving or remaining so it's clear that they're not candidates for further pruning. -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Fri, 22 Mar 2019 23:44:42 +0000 Gerrit-HasComments: Yes