[Impala-ASF-CR] WIP: IMPALA-5843: Use page index in Parquet files to skip pages

Zoltan Borok-Nagy (Code Review) Thu, 31 Jan 2019 09:17:51 -0800

Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )


Change subject: WIP: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 4:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/hdfs-parquet-scanner.h@432
PS1, Line 432: ctionC
> mention that it's the column index within Parquet file?
Done


http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/hdfs-parquet-scanner.h@446
PS1, Line 446:
> types: pages
Done


http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@619
PS1, Line 619:         EvaluateStatsConjuncts(file_metadata_, row_group, 
&skip_row_group_on_stats));
> I guess it depends on whether errors indicate a problem with the index, or
For now I leave it as it is.


http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@686
PS1, Line 686: F
> I think this function needs a comment (maybe you were planning anyway since
Removed this lambda and introduced a new class called ParquetPageIndex.


http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@718
PS1, Line 718:     
RETURN_IF_ERROR(page_index_.DeserializeColumnIndex(col_chunk, &column_index));
> It would be great to reduce duplication between this loop and the one in Hd
Added a TODO about it. The code structure is similar, but we do different 
things in the nested parts. I will think about it.


http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@738
PS1, Line 738:         if (fn_name == "lt" || fn_name == "le") {
> I think doing an I/O for each column will get pretty expensive (as opposed
I created a new class called ParquetPageIndex. It reads the whole page index 
and stores in an internal buffer. Here we only deserialize the parts of the 
page index that we are interested in.


http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@745
PS1, Line 745:           DCHECK(false) << "Unsupported function name for 
statistics evaluation: "
> The timestamp related hacking seems to be missing, see https://github.com/a
Done


http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@754
PS1, Line 754:         BaseScalarColumnReader* scalar_reader
> Do you have plans for null pages? They should be also skipped for "fn"/"lt"
Thanks, now I skip rows belonging to null pages.


http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@1556
PS1, Line 1556:     ParquetColumnReader* reader;
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/12065/2/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@1562
PS2, Line 1562:   if (pos_slot_desc != nullptr) {
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/parquet-column-stats.cc
File be/src/exec/parquet/parquet-column-stats.cc:

http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/parquet-column-stats.cc@96
PS1, Line 96: YPE_BIGINT
> Is there a reason for changing element_.type to col_chunk_.meta_data.type?
No, it was accidental. Reverted it.


http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/parquet-common.cc
File be/src/exec/parquet/parquet-common.cc:

http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/parquet-common.cc@98
PS1, Line 98: if (skip_end < skip_range.second) skip_end = skip_range.second;
> Is it possible for this condition to fail if skipped ranges cannot overlap?
At this point 'skip_ranges' can contain overlapping ranges.

Therefore we need this condition in the following case, e.g.:
 * we have two ranges, [0, 10], [2, 5]
 * in the first iteration, we'll set skip_end to 10
 * in the second iteration, at L97, skip_end(=10) + 1 >= skip_range.first(=2)
 * But we don't want to set skip_end to 5, so we need this condition.


http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/parquet-level-decoder.h
File be/src/exec/parquet/parquet-level-decoder.h:

http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/exec/parquet/parquet-level-decoder.h@62
PS1, Line 62: move
> nit: move
Done


http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/util/bit-stream-utils.h
File be/src/util/bit-stream-utils.h:

PS1:
> Might be worth pulling out the rle and dict decoder changes into a separate
Done


http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/12065/1/be/src/util/dict-encoding.h@549
PS1, Line 549:     int64_t num_to_skip = std::min<int64_t>(num_literal_values_ -
> line too long (101 > 90)
Done



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 31 Jan 2019 17:09:21 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP: IMPALA-5843: Use page index in Parquet files to skip pages

Reply via email to