Yifan Zhang has posted comments on this change. ( http://gerrit.cloudera.org:8080/20804 )
Change subject: IMPALA-12631: Improve count star performance for parquet scans ...................................................................... Patch Set 13: (4 comments) http://gerrit.cloudera.org:8080/#/c/20804/12//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20804/12//COMMIT_MSG@16 PS12, Line 16: A new query option parquet_count_star_use_file_metadata is added for : forward compatibility. Its default value is true, if any inconsistency : between FileMetaData.num_rows and RowGroup.num_rows is found, we can : set it to false to get same results as old versions. > Probably that would be a corrupt Parquet file. But if we are afraid of inco Yeah. I adjusted it to sum RowGroup.num_rows in PS13 and got the same performance improvement by running the single node perf test. Then I think we do not need to introduce this new query option since no behavior changes are made. What do you think? http://gerrit.cloudera.org:8080/#/c/20804/12/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/20804/12/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1452 PS12, Line 1452: if (isFooterOnly) { : // Only generate one scan range for footer only scans. : currentOffset += remainingLength - currentLength; : remainingLength = currentLength; : } > Why do we need to this now? We didn't do that for partition key scans. For count star optimization scans, it's not a zero-slot scan, we have one slot for num rows statistic. But a partition scan is a zero-slot scan. We create a footer range for every scan range if it is not a zero-slot scan in HdfsScanner::IssueFooterRanges(). http://gerrit.cloudera.org:8080/#/c/20804/12/tests/query_test/test_aggregation.py File tests/query_test/test_aggregation.py: http://gerrit.cloudera.org:8080/#/c/20804/12/tests/query_test/test_aggregation.py@275 PS12, Line 275: > flake8: E501 line too long (91 > 90 characters) Done http://gerrit.cloudera.org:8080/#/c/20804/12/tests/query_test/test_aggregation.py@277 PS12, Line 277: > flake8: E501 line too long (91 > 90 characters) Done -- To view, visit http://gerrit.cloudera.org:8080/20804 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib9cd2448fe51a420d4559d0cc861c4d30822f4fd Gerrit-Change-Number: 20804 Gerrit-PatchSet: 13 Gerrit-Owner: Yifan Zhang <chinazhangyi...@163.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Yifan Zhang <chinazhangyi...@163.com> Gerrit-Reviewer: Zihao Ye <eyiz...@163.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Mon, 22 Jan 2024 09:02:58 +0000 Gerrit-HasComments: Yes