[jira] [Commented] (IMPALA-11802) Optimize count(*) queries for Iceberg V2 tables

Jira Tue, 03 Jan 2023 09:52:04 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654121#comment-17654121
 ]


Zoltán Borók-Nagy commented on IMPALA-11802:
--------------------------------------------

I think we should assume that Cardinality(table) is not equal to 
Cardinality(data files) - Cardinality(delete files).

Because
 * Concurent deletes might create delete files that reference the same rows:
 ** 
[https://github.com/apache/iceberg/blob/cecb10bb8ab0458fb3f6a650692a8e432f08cbd2/api/src/main/java/org/apache/iceberg/RowDelta.java#L131-L133]
 * Partial compactions, e.g.:
 ## Table has data files: A, B, X and delete file: D
 ## D references A and B
 ## Now we rewrite the small files which are A and X
 ## So now the table has data files AX', B, and delete file D
 ## In this case it's clear that numRows(table) is not equal to 
numRows(dataFiles) - numRows(deleteFiles)
 ## (Though the above could be fixed by rewriting the delete file to D' to only 
reference rows in B. AFAICT Iceberg does not do that)

> Optimize count(*) queries for Iceberg V2 tables
> -----------------------------------------------
>
>                 Key: IMPALA-11802
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11802
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> Simple {{SELECT count( * ) FROM ice_v2_tbl;}} could be optimized.
> At first we need to investigate if the following is true:
> If a V2 table only has position delete files, then the cardinality is
> {noformat}
> Cardinality(data files) - Cardinality(delete files)
> {noformat}
> If this is true, then we can answer count( * ) queries via a query rewrite 
> similarly to what we do for V1 tables: IMPALA-11279
> If the above is not true, we can still optimize count( * ) queries by:
> {noformat}
>         SUM
>          |
>      UNION ALL
>       /     \
>      /       \
>     /         \
> COUNT(*)     COUNT(*)
>   /                \
> SCAN             ANTI JOIN
> data files         /      \
> without           /        \
> deletes       SCAN         SCAN
>               data files   delete files
>               with deletes
> {noformat}
> The SCAN operator with "data files without deletes" could benefit from count( 
> * ) optimization (they would only need to read file metadata). In the common 
> case (when there are few deletes) this SCAN is in charge of scanning the vast 
> majority of data files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-11802) Optimize count(*) queries for Iceberg V2 tables

Reply via email to