Qifan Chen has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out 
rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet 
table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less than 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing:
 1. Ran existing tests
 2. Added UT for 'ScratchTupleBatch::GetMicroBatch'
 3. Added end-to-end test for late materialization.
Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Reviewed-on: http://gerrit.cloudera.org:8080/17860
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Reviewed-by: Qifan Chen <qc...@cloudera.com>
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization.test
A tests/query_test/test_parquet_late_materialization.py
22 files changed, 1,070 insertions(+), 52 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Qifan Chen: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 19
Gerrit-Owner: Amogh Margoor <amarg...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <amarg...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>

Reply via email to