Amogh Margoor has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/17860 )
Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table. ...................................................................... IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table. Currently, entire row is materialized before filtering during scan. Instead of paying the cost of materializing upfront, for columnar formats we can avoid doing it for rows that are filtered out. Columns that are required for filtering are the only ones that are needed to be materialized before filtering. For rest of the columns, materialization can be delayed and be done only for rows that survive. This patch implements this technique for Parquet format only. New configuration 'parquet_materialization_threshold' is introduced, which is minimum number of consecutive rows that are filtered out to avoid materialization. If set to less that 0, it disables the late materialization. Performance: Peformance measured for single daemon, single threaded impalad upon TPCH scale 42 lineitem table with 252 million rows, unsorted data. Upto 2.5x improvement for non-page indexed and upto 4x improvement in page index seen. Queries for page index borrowed from blog: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ More details: https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing Testing: TBD Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-columnar-scanner-ir.cc M be/src/exec/hdfs-columnar-scanner.cc M be/src/exec/hdfs-columnar-scanner.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-collection-column-reader.cc M be/src/exec/parquet/parquet-collection-column-reader.h M be/src/exec/parquet/parquet-column-chunk-reader.cc M be/src/exec/parquet/parquet-column-chunk-reader.h M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-column-readers.h A be/src/exec/scratch-tuple-batch-test.cc M be/src/exec/scratch-tuple-batch.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/tuple-row-compare.h M common/thrift/ImpalaService.thrift M common/thrift/Query.thrift 19 files changed, 897 insertions(+), 91 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/6 -- To view, visit http://gerrit.cloudera.org:8080/17860 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60 Gerrit-Change-Number: 17860 Gerrit-PatchSet: 6 Gerrit-Owner: Amogh Margoor <amarg...@gmail.com> Gerrit-Reviewer: Amogh Margoor <amarg...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>