Hello Riza Suminto, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/20804

to look at the new patch set (#4).

Change subject: IMPALA-12631: Improve count star performance for parquet scans
......................................................................

IMPALA-12631: Improve count star performance for parquet scans

Backend function HdfsParquetScanner::GetNextInternal() uses the data
stored in the Parquet RowGroup.num_rows field to compute count star,
it still needs to find row groups and sum all RowGroup.num_rows.
This patch uses the 'num_rows' field in Parquet file metadata, it
avoids NextRowGroup() function calls, generates and processes only one
footer range per file.

A new query option parquet_count_star_use_file_metadata is added for
forward compatibility. Its default value is true, if any inconsistency
between FileMetaData.num_rows and RowGroup.num_rows is found, we can
set it to false to get same results as old versions.

The following table shows a performance comparison before and after
the patch. primitive_count_star_multiblock query is a modified
primitive_count_star query that targets a multi-block
tpch10_parquet.lineitem table. The files of the table is generated
by the command `hdfs dfs -Ddfs.block.size=1048576 -cp -f -d`.

+-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+
| Workload          | Query                           | File Format           | 
Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | 
Median Diff(%) | MW Zval | Tval   |
+-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+
| TPCDS(10)         | TPCDS-Q_COUNT_OPTIMIZED         | parquet / none / none | 
0.17   | 0.16        |   +2.58%   | * 29.53% * | * 27.16% *     | 30    |   
+1.20%       | 0.58    | 0.35   |
| TPCDS(10)         | TPCDS-Q_COUNT_UNOPTIMIZED       | parquet / none / none | 
0.27   | 0.26        |   +2.96%   |   8.97%    |   9.94%        | 30    |   
+0.16%       | 0.44    | 1.19   |
| TPCDS(10)         | TPCDS-Q_COUNT_ZERO_SLOT         | parquet / none / none | 
0.18   | 0.18        |   -0.69%   |   1.65%    |   1.99%        | 30    |   
-0.34%       | -1.55   | -1.47  |
| TARGETED-PERF(10) | primitive_count_star_multiblock | parquet / none / none | 
0.06   | 0.12        | I -49.88%  |   4.11%    |   3.53%        | 30    | I 
-99.97%      | -6.54   | -66.81 |
+-------------------+---------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+

Testing:
- Ran PlannerTest#testParquetStatsAgg
- Ran query_test/test_aggregation.py

Change-Id: Ib9cd2448fe51a420d4559d0cc861c4d30822f4fd
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats-agg.test
A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_optimized.test
A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_unoptimized.test
A testdata/workloads/tpcds/queries/tpcds-decimal_v2-q_count_zero_slot.test
10 files changed, 124 insertions(+), 21 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/20804/4
--
To view, visit http://gerrit.cloudera.org:8080/20804
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib9cd2448fe51a420d4559d0cc861c4d30822f4fd
Gerrit-Change-Number: 20804
Gerrit-PatchSet: 4
Gerrit-Owner: Yifan Zhang <chinazhangyi...@163.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Yifan Zhang <chinazhangyi...@163.com>

Reply via email to