Hello Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12065

to look at the new patch set (#3).

Change subject: WIP: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................

WIP: IMPALA-5843: Use page index in Parquet files to skip pages

Initial prototype of page filtering.

Some detailes are smelly, but conceptually it is getting into
shape. The read and evaluation of the page index is done by
the HdfsParquetScanner. At first, we determine the row ranges
we are interested in, and based on the row ranges we determine
the filtered pages for each column that we are reading.

We still issue one ScanRange per column chunk, but we specify
sub-ranges that store the filtered pages, i.e. we don't read
the whole column chunk, but only fractions of it.

Pages are not aligned across column chunks, i.e. page #2 of column A
might store completely different rows than page #2 of column B.
It means we need to implement some kind of row-skipping logic
when we read the data pages. This logic is implemented in
BaseScalarColumnReader and ScalarColumnReader. Collection column
readers know nothing about page filtering.

TODOs:
* fix the smelly code parts
* add counters about filtered pages
* generate files by Impala and Parquet-MR for EE tests
* test with nested types
* performance measurements

Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-level-decoder.h
A be/src/exec/parquet/parquet-page-index.cc
A be/src/exec/parquet/parquet-page-index.h
M be/src/exprs/literal.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
19 files changed, 832 insertions(+), 49 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/12065/3
--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>

Reply via email to