Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22873 )

Change subject: IMPALA-14014: Fix COMPUTE STATS with TABLESAMPLE clause
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@830
PS6, Line 830: Map<Long, List<FileDescriptor>> sample = 
feFsTable.getFilesSample(
             :         samplePerc, minSampleBytes, sampleSeed);
> Do I understand it correctly that we calculate the file samples twice? Firs
Yes, in case of COMPUTE STATS, we are doing it twice, first here, then in the 
SCAN plan nodes. IIUC, COMPUTE STATS need to calculate the sampling to set the 
value of 'effectiveSamplePerc_' which is used in stats extrapolation to get 
more precise stats.

In other cases, e.g. SELECT * FROM t TABLESAMPLE SYSTEM(10); we only calculate 
the sampling in the SCAN plan nodes.

We could probably improve this, to always calculate the file samples once, 
during analysis, though that would be a non-trivial change. Let's discuss it 
separately under a different ticket. Would you mind filing one?


http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
File fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java:

http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java@364
PS6, Line 364: // We don't sample delete files (for correctness), let's add all 
of them to
             :     // the merged result.
> AFAIK a delete file can reference multiple data files. Is it possible that
Great observation! I extended the cardinality estimation logic in 
IcebergDeleteNode.computeStats() to take the sampling into account. Also added 
a new query to tablesample.test.


http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java
File fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java:

http://gerrit.cloudera.org:8080/#/c/22873/6/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@33
PS6, Line 33:
            : import 
org.apache.curator.shaded.com.google.common.base.Preconditions;
> Ack
Done



--
To view, visit http://gerrit.cloudera.org:8080/22873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie59d5fc1374ab69209a74f2488bcb9a7d510b782
Gerrit-Change-Number: 22873
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 03 Jun 2025 11:12:55 +0000
Gerrit-HasComments: Yes

Reply via email to