Hello Riza Suminto, Impala Public Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/20804 to look at the new patch set (#4). Change subject: IMPALA-12631: Improve count star performance for parquet scans ...................................................................... IMPALA-12631: Improve count star performance for parquet scans Backend function HdfsParquetScanner::GetNextInternal() uses the data stored in the Parquet RowGroup.num_rows field to compute count star, it still needs to find row groups and sum all RowGroup.num_rows. This patch uses the 'num_rows' field in Parquet file metadata, it avoids NextRowGroup() function calls, generates and processes only one footer range per file. A new query option parquet_count_star_use_file_metadata is added for forward compatibility. Its default value is true, if any inconsistency between FileMetaData.num_rows and RowGroup.num_rows is found, we can set it to false to get same results as old versions. The following table shows a performance comparison before and after the patch. primitive_count_star_multiblock query is a modified primitive_count_star query that targets a multi-block tpch10_parquet.lineitem table. The files of the table is generated by the command `hdfs dfs -Ddfs.block.size=1048576 -cp -f -d`. +-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ | Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval | +-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ | TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED | parquet / none / none | 0.17 | 0.16 | +2.58% | * 29.53% * | * 27.16% * | 30 | +1.20% | 0.58 | 0.35 | | TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | parquet / none / none | 0.27 | 0.26 | +2.96% | 8.97% | 9.94% | 30 | +0.16% | 0.44 | 1.19 | | TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT | parquet / none / none | 0.18 | 0.18 | -0.69% | 1.65% | 1.99% | 30 | -0.34% | -1.55 | -1.47 | | TARGETED-PERF(10) | primitive_count_star_multiblock | parquet / none / none | 0.06 | 0.12 | I -49.88% | 4.11% | 3.53% | 30 | I -99.97% | -6.54 | -66.81 | +-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ Testing: - Ran PlannerTest#testParquetStatsAgg - Ran query_test/test_aggregation.py Change-Id: Ib9cd2448fe51a420d4559d0cc861c4d30822f4fd --- M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaService.thrift M common/thrift/Query.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_optimized.test A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_unoptimized.test A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_zero_slot.test 10 files changed, 124 insertions(+), 21 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/20804/4 -- To view, visit http://gerrit.cloudera.org:8080/20804 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib9cd2448fe51a420d4559d0cc861c4d30822f4fd Gerrit-Change-Number: 20804 Gerrit-PatchSet: 4 Gerrit-Owner: Yifan Zhang <chinazhangyi...@163.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Yifan Zhang <chinazhangyi...@163.com>