[ https://issues.apache.org/jira/browse/IMPALA-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654121#comment-17654121 ]
Zoltán Borók-Nagy commented on IMPALA-11802: -------------------------------------------- I think we should assume that Cardinality(table) is not equal to Cardinality(data files) - Cardinality(delete files). Because * Concurent deletes might create delete files that reference the same rows: ** [https://github.com/apache/iceberg/blob/cecb10bb8ab0458fb3f6a650692a8e432f08cbd2/api/src/main/java/org/apache/iceberg/RowDelta.java#L131-L133] * Partial compactions, e.g.: ## Table has data files: A, B, X and delete file: D ## D references A and B ## Now we rewrite the small files which are A and X ## So now the table has data files AX', B, and delete file D ## In this case it's clear that numRows(table) is not equal to numRows(dataFiles) - numRows(deleteFiles) ## (Though the above could be fixed by rewriting the delete file to D' to only reference rows in B. AFAICT Iceberg does not do that) > Optimize count(*) queries for Iceberg V2 tables > ----------------------------------------------- > > Key: IMPALA-11802 > URL: https://issues.apache.org/jira/browse/IMPALA-11802 > Project: IMPALA > Issue Type: Bug > Components: Frontend > Reporter: Zoltán Borók-Nagy > Priority: Major > Labels: impala-iceberg > > Simple {{SELECT count( * ) FROM ice_v2_tbl;}} could be optimized. > At first we need to investigate if the following is true: > If a V2 table only has position delete files, then the cardinality is > {noformat} > Cardinality(data files) - Cardinality(delete files) > {noformat} > If this is true, then we can answer count( * ) queries via a query rewrite > similarly to what we do for V1 tables: IMPALA-11279 > If the above is not true, we can still optimize count( * ) queries by: > {noformat} > SUM > | > UNION ALL > / \ > / \ > / \ > COUNT(*) COUNT(*) > / \ > SCAN ANTI JOIN > data files / \ > without / \ > deletes SCAN SCAN > data files delete files > with deletes > {noformat} > The SCAN operator with "data files without deletes" could benefit from count( > * ) optimization (they would only need to read file metadata). In the common > case (when there are few deletes) this SCAN is in charge of scanning the vast > majority of data files. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org