Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/22873 )
Change subject: IMPALA-14014: Fix COMPUTE STATS with TABLESAMPLE clause ...................................................................... Patch Set 6: (3 comments) http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java: http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@830 PS6, Line 830: Map<Long, List<FileDescriptor>> sample = feFsTable.getFilesSample( : samplePerc, minSampleBytes, sampleSeed); > Do I understand it correctly that we calculate the file samples twice? Firs Yes, in case of COMPUTE STATS, we are doing it twice, first here, then in the SCAN plan nodes. IIUC, COMPUTE STATS need to calculate the sampling to set the value of 'effectiveSamplePerc_' which is used in stats extrapolation to get more precise stats. In other cases, e.g. SELECT * FROM t TABLESAMPLE SYSTEM(10); we only calculate the sampling in the SCAN plan nodes. We could probably improve this, to always calculate the file samples once, during analysis, though that would be a non-trivial change. Let's discuss it separately under a different ticket. Would you mind filing one? http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java File fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java: http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java@364 PS6, Line 364: // We don't sample delete files (for correctness), let's add all of them to : // the merged result. > AFAIK a delete file can reference multiple data files. Is it possible that Great observation! I extended the cardinality estimation logic in IcebergDeleteNode.computeStats() to take the sampling into account. Also added a new query to tablesample.test. http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java File fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java: http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@33 PS6, Line 33: : import org.apache.curator.shaded.com.google.common.base.Preconditions; > Ack Done -- To view, visit http://gerrit.cloudera.org:8080/22873 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie59d5fc1374ab69209a74f2488bcb9a7d510b782 Gerrit-Change-Number: 22873 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 03 Jun 2025 11:12:55 +0000 Gerrit-HasComments: Yes
