Zoltan Borok-Nagy has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/20804 )
Change subject: IMPALA-12631: Improve count star performance for parquet scans ...................................................................... IMPALA-12631: Improve count star performance for parquet scans Before this patch frontend generates multiple scan ranges for a parquet file when count star optimization is enabled. Backend function HdfsParquetScanner::GetNextInternal() also call NextRowGroup() multiple times to find row groups and sum up RowGroup.num_rows. This could be inefficient because we only need to read file metadata to compute count star. This patch optimizes it by creating only one scan range that contains the file footer for each parquet file. The following table shows a performance comparison before and after the patch. primitive_count_star_multiblock query is a modified primitive_count_star query that targets a multi-block tpch10_parquet.lineitem table. The files of the table are generated by the command `hdfs dfs -Ddfs.block.size=1048576 -cp -f -d`. +-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ | Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval | +-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ | TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED | parquet / none / none | 0.17 | 0.16 | +2.58% | * 29.53% * | * 27.16% * | 30 | +1.20% | 0.58 | 0.35 | | TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | parquet / none / none | 0.27 | 0.26 | +2.96% | 8.97% | 9.94% | 30 | +0.16% | 0.44 | 1.19 | | TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT | parquet / none / none | 0.18 | 0.18 | -0.69% | 1.65% | 1.99% | 30 | -0.34% | -1.55 | -1.47 | | TARGETED-PERF(10) | primitive_count_star_multiblock | parquet / none / none | 0.06 | 0.12 | I -49.88% | 4.11% | 3.53% | 30 | I -99.97% | -6.54 | -66.81 | +-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ Testing: - Ran PlannerTest#testParquetStatsAgg - Added new test cases to query_test/test_aggregation.py Change-Id: Ib9cd2448fe51a420d4559d0cc861c4d30822f4fd Reviewed-on: http://gerrit.cloudera.org:8080/20804 Reviewed-by: Zoltan Borok-Nagy <borokna...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> --- M be/src/exec/parquet/hdfs-parquet-scanner.cc M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/workloads/functional-query/queries/QueryTest/hdfs-tiny-scan.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test M testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_optimized.test A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_unoptimized.test A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_zero_slot.test M tests/util/parse_util.py 11 files changed, 138 insertions(+), 63 deletions(-) Approvals: Zoltan Borok-Nagy: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/20804 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ib9cd2448fe51a420d4559d0cc861c4d30822f4fd Gerrit-Change-Number: 20804 Gerrit-PatchSet: 16 Gerrit-Owner: Yifan Zhang <chinazhangyi...@163.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Yifan Zhang <chinazhangyi...@163.com> Gerrit-Reviewer: Zihao Ye <eyiz...@163.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>