Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 18: Code-Review+1 (8 comments) I mostly checked the changes that pertained to my comments on an older patch set. I'm a bit short on time currently but I also don't want to hold up this change, so I'd like to defer to Csaba for the +2. http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@464 PS17, Line 464: bool GetRequiredStatsField(const string& fn_name, I think this could go into parquet-column-stats.{h,cc}. http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@546 PS17, Line 546: ColumnStatsReader::StatsField stats_field = ColumnStatsReader::StatsField::MIN; I think we usually don't initialize output parameters to make it clear that they're expected to be set by the function they're passed into. http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@643 PS17, Line 643: // We don't need the raw page index buffers anymore. : page_index_.Release(); Can this go to ProcessPageIndex? It already touches a bunch of other state. http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-column-readers.h@400 PS17, Line 400: /// True, if we are using NextLevels() to readahead the next def and rep levels. In this I feel that this field needs more explanation. From just looking at the comment I cannot tell when one would set this, and why it is needed. http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-column-readers.h@403 PS17, Line 403: bool levels_readahead_ = false; Would it simplify the code to make this levels_read_ahead_offset_ (being -1 or 0 (or 0 and 1)) and then just add it where needed? http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-page-index-test.cc File be/src/exec/parquet/parquet-page-index-test.cc: http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-page-index-test.cc@58 PS17, Line 58: void ValidatePageIndexRange(const RowGroupRanges& row_group_ranges, Add a comment for this one, too? http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-page-index.h File be/src/exec/parquet/parquet-page-index.h: http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-page-index.h@56 PS17, Line 56: if there's at least parts of the page index are present nit: grammar http://gerrit.cloudera.org:8080/#/c/12065/17/tests/query_test/test_parquet_stats.py File tests/query_test/test_parquet_stats.py: http://gerrit.cloudera.org:8080/#/c/12065/17/tests/query_test/test_parquet_stats.py@87 PS17, Line 87: for batch_size in [0, 1]: Should we use a proper test dimension for the batch size, e.g. like in tests/query_test/test_scanners.py:168 ? -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 18 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Sun, 28 Apr 2019 22:14:41 +0000 Gerrit-HasComments: Yes