Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 11: (6 comments) http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@644 PS8, Line 644: // {min: 10, max: 20}, and query is 'select * from T where A = 8'. > This is not the only case when it can happen. Another example is multi key Thanks, yeah, so basically it can happen anytime when you have predicates against different columns, and the separate row ranges that pass the predicates don't have a common subset. I extended the comment. http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/hdfs-parquet-scanner.cc@618 PS11, Line 618: COUNTER_ADD(num_row_groups_counter_, 1); > optional: This loop grew really large, > 100 lines, doesn't even fit to my This change is already quite huge, I would rather not add unrelated refactorings. http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.h File be/src/exec/parquet/parquet-page-index.h: http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.h@39 PS11, Line 39: /// It reads the raw bytes of the whole page index and stores it in an : /// internal buffer. : /// It doesn't expect that the Page index in a particular layout, it only : /// expects that the whole page index layed out continuously in the file. : /// It needs to be called before the serialization methods. : Status ReadAll(); > Can you mention that this reads the page index for all row groups? For me t Changed this function to only read the indexes belonging to a single row group. Updated the comment. http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.h@61 PS11, Line 61: /// Common helper for deserialization > nit: other member comments have . at the end Done http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.h@65 PS11, Line 65: /// The scanner that created this object > nit: other member comments have . at the end Done http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.cc File be/src/exec/parquet/parquet-page-index.cc: http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.cc@81 PS11, Line 81: page_index_buffer_.TryAllocate(size) > Sorry for the wrong assumptions, I got lost somewhere in HdfsParquetScanner Made this change, now we only read the indexes of a single row group, and after evaluating it we release the allocated buffer. -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 11 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Wed, 17 Apr 2019 12:50:07 +0000 Gerrit-HasComments: Yes