Hello Michael Ho, Lars Volker, Pooja Nilangekar, Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12065 to look at the new patch set (#7). Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... IMPALA-5843: Use page index in Parquet files to skip pages This commit implements page filtering based on the Parquet page index. The read and evaluation of the page index is done by the HdfsParquetScanner. At first, we determine the row ranges we are interested in, and based on the row ranges we determine the candidate pages for each column that we are reading. We still issue one ScanRange per column chunk, but we specify sub-ranges that store the candidate pages, i.e. we don't read the whole column chunk, but only fractions of it. Pages are not aligned across column chunks, i.e. page #2 of column A might store completely different rows than page #2 of column B. It means we need to implement some kind of row-skipping logic when we read the data pages. This logic is implemented in BaseScalarColumnReader and ScalarColumnReader. Collection column readers know nothing about page filtering. Page filtering can be turned off by setting the query option 'read_parquet_page_index' to false. Testing: * added added some unit tests for the row range and page selection logic * generated various Parquet files with Parquet-MR * enabled Page index writing and wrote selective queries against tables written by Impala. Current tests are likely to use page filtering transparently. Performance: * measured locally, observed 3x to 10x speedup for selective queries TODO: * run standard benchmarks * measure performance for remote reads Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a --- M be/src/common/global-flags.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet/CMakeLists.txt M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-column-readers.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-column-stats.h A be/src/exec/parquet/parquet-common-test.cc M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-level-decoder.h A be/src/exec/parquet/parquet-page-index.cc A be/src/exec/parquet/parquet-page-index.h M be/src/exprs/literal.cc M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/dict-encoding.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/data/README A testdata/data/alltypes_tiny_pages.parquet A testdata/data/decimals_1_10.parquet A testdata/data/double_nested_decimals.parquet A testdata/data/nested_decimals.parquet A testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test A testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test A testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-large.test A testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test M testdata/workloads/functional-query/queries/QueryTest/stats-extrapolation.test M tests/query_test/test_parquet_stats.py 33 files changed, 2,662 insertions(+), 80 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/12065/7 -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>