Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 12: Code-Review+1 (5 comments) http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/hdfs-parquet-scanner.cc@618 PS11, Line 618: // If the row group overlaps with the optional: This loop grew really large, > 100 lines, doesn't even fit to my huge monitor. :) At this point it was decided that the row group belongs to the scanner. The rest could be extracted to functions like Status ApplyStatFilters(const parquet::RowGroup& row_group, bool* row_group_skipped) and Status ScanDictioneries(const parquet::RowGroup& row_group, bool* row_group_skipped) http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.h File be/src/exec/parquet/parquet-page-index.h: http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.h@39 PS11, Line 39: /// It reads the raw bytes of the whole page index and stores it in an : /// internal buffer. : /// It doesn't expect that the Page index in a particular layout, it only : /// expects that the whole page index layed out continuously in the file. : /// It needs to be called before the serialization methods. : Status ReadAll(); Can you mention that this reads the page index for all row groups? For me this was not intuitive. The function's name could also reflect this, as "ReadAll" could also mean reading it for all column chunks in the row group. http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.h@61 PS11, Line 61: /// Common helper for deserialization nit: other member comments have . at the end http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.h@65 PS11, Line 65: /// The scanner that created this object nit: other member comments have . at the end http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.cc File be/src/exec/parquet/parquet-page-index.cc: http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.cc@81 PS11, Line 81: > 1. The ScanNode creates and deletes scanners for the file splits. This mean Sorry for the wrong assumptions, I got lost somewhere in HdfsParquetScanner::NextRowGroup(). I still think though that we hold onto memory that we do not want to use later. Also, theoretically there can be huge Parquet files with lot of row groups, so the complete page index can be arbitrarily large. Would it be problematic to read the page index for every row group separately during ProcessPageIndex()? This would mean that if the row group is skipped because of row group level stats or because of it is out of the split (the latter is the normal case for multi row group files I guess), then we would avoid reading it's page index. Regardless, for the single row group case, releasing the buffer if we are reading the last row group would be enough. This would avoid the memory wasting for Parquet files written by Impala. I am ok with dealing with this in another patch. -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 12 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Tue, 16 Apr 2019 14:01:25 +0000 Gerrit-HasComments: Yes