Hello Riza Suminto, Zoltan Borok-Nagy, Zihao Ye, Csaba Ringhofer, Impala Public 
Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/20804

to look at the new patch set (#14).

Change subject: IMPALA-12631: Improve count star performance for parquet scans
......................................................................

IMPALA-12631: Improve count star performance for parquet scans

Before this patch frontend generates multiple scan ranges for a
parquet file when count star optimization is enabled. Backend function
HdfsParquetScanner::GetNextInternal() also call NextRowGroup()
multiple times to find row groups and sum up RowGroup.num_rows. This
could be inefficient because we only need to read file metadata to
compute count star. This patch optimizes it by creating only one
scan range that contains the file footer for each parquet file.

The following table shows a performance comparison before and after
the patch. primitive_count_star_multiblock query is a modified
primitive_count_star query that targets a multi-block
tpch10_parquet.lineitem table. The files of the table are generated
by the command `hdfs dfs -Ddfs.block.size=1048576 -cp -f -d`.

+-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+
| Workload          | Query                           | File Format           | 
Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | 
Median Diff(%) | MW Zval | Tval   |
+-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+
| TPCDS(10)         | TPCDS-Q_COUNT_OPTIMIZED         | parquet / none / none | 
0.17   | 0.16        |   +2.58%   | * 29.53% * | * 27.16% *     | 30    |   
+1.20%       | 0.58    | 0.35   |
| TPCDS(10)         | TPCDS-Q_COUNT_UNOPTIMIZED       | parquet / none / none | 
0.27   | 0.26        |   +2.96%   |   8.97%    |   9.94%        | 30    |   
+0.16%       | 0.44    | 1.19   |
| TPCDS(10)         | TPCDS-Q_COUNT_ZERO_SLOT         | parquet / none / none | 
0.18   | 0.18        |   -0.69%   |   1.65%    |   1.99%        | 30    |   
-0.34%       | -1.55   | -1.47  |
| TARGETED-PERF(10) | primitive_count_star_multiblock | parquet / none / none | 
0.06   | 0.12        | I -49.88%  |   4.11%    |   3.53%        | 30    | I 
-99.97%      | -6.54   | -66.81 |
+-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+

Testing:
- Ran PlannerTest#testParquetStatsAgg
- Added new test cases to query_test/test_aggregation.py

Change-Id: Ib9cd2448fe51a420d4559d0cc861c4d30822f4fd
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-query/queries/QueryTest/hdfs-tiny-scan.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_optimized.test
A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_unoptimized.test
A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_zero_slot.test
M tests/util/parse_util.py
11 files changed, 144 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/20804/14
--
To view, visit http://gerrit.cloudera.org:8080/20804
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib9cd2448fe51a420d4559d0cc861c4d30822f4fd
Gerrit-Change-Number: 20804
Gerrit-PatchSet: 14
Gerrit-Owner: Yifan Zhang <chinazhangyi...@163.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Yifan Zhang <chinazhangyi...@163.com>
Gerrit-Reviewer: Zihao Ye <eyiz...@163.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>

Reply via email to