Yifan Zhang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20804 )

Change subject: IMPALA-12631: Improve count star performance for parquet scans
......................................................................


Patch Set 13:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/20804/12//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20804/12//COMMIT_MSG@16
PS12, Line 16: A new query option parquet_count_star_use_file_metadata is added 
for
             : forward compatibility. Its default value is true, if any 
inconsistency
             : between FileMetaData.num_rows and RowGroup.num_rows is found, we 
can
             : set it to false to get same results as old versions.
> Probably that would be a corrupt Parquet file. But if we are afraid of inco
Yeah. I adjusted it to sum RowGroup.num_rows in PS13 and got the same 
performance improvement by running the single node perf test.

Then I think we do not need to introduce this new query option since no 
behavior changes are made. What do you think?


http://gerrit.cloudera.org:8080/#/c/20804/12/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/20804/12/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1452
PS12, Line 1452:           if (isFooterOnly) {
               :             // Only generate one scan range for footer only 
scans.
               :             currentOffset += remainingLength - currentLength;
               :             remainingLength = currentLength;
               :           }
> Why do we need to this now? We didn't do that for partition key scans.
For count star optimization scans, it's not a zero-slot scan, we have one slot 
for num rows statistic. But a partition scan is a zero-slot scan. We create a 
footer range for every scan range if it is not a zero-slot scan in 
HdfsScanner::IssueFooterRanges().


http://gerrit.cloudera.org:8080/#/c/20804/12/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/20804/12/tests/query_test/test_aggregation.py@275
PS12, Line 275:
> flake8: E501 line too long (91 > 90 characters)
Done


http://gerrit.cloudera.org:8080/#/c/20804/12/tests/query_test/test_aggregation.py@277
PS12, Line 277:
> flake8: E501 line too long (91 > 90 characters)
Done



--
To view, visit http://gerrit.cloudera.org:8080/20804
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib9cd2448fe51a420d4559d0cc861c4d30822f4fd
Gerrit-Change-Number: 20804
Gerrit-PatchSet: 13
Gerrit-Owner: Yifan Zhang <chinazhangyi...@163.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Yifan Zhang <chinazhangyi...@163.com>
Gerrit-Reviewer: Zihao Ye <eyiz...@163.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Jan 2024 09:02:58 +0000
Gerrit-HasComments: Yes

Reply via email to