[jira] [Commented] (IMPALA-11802) Optimize count(*) queries for Iceberg V2 tables

ASF subversion and git services (Jira) Tue, 21 Feb 2023 23:41:06 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691986#comment-17691986
 ]


ASF subversion and git services commented on IMPALA-11802:
----------------------------------------------------------

Commit 3153490545d1b3730ba17bc020909f2ae9c18d94 in impala's branch 
refs/heads/master from Li Penglin
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=315349054 ]

IMPALA-11802: Optimize count(*) queries for Iceberg V2 position delete tables

The SCAN plan of count star query for Iceberg V2 position delete tables
as follows:

    AGGREGATE
    COUNT(*)
        |
    UNION ALL
   /         \
  /           \
 /             \
SCAN all    ANTI JOIN
datafiles  /         \
without   /           \
deletes  SCAN         SCAN
         datafiles    deletes

Since Iceberg provides the number of records in a file(record_count), we
can use this to optimize a simple count star query for Iceberg V2
position delete tables. Firstly, the number of records of all DataFiles
without corresponding DeleteFiles can be calculated by Iceberg meta
files. And then rewrite the query as follows:

      ArithmeticExpr(ADD)
      /             \
     /               \
    /                 \
record_count       AGGREGATE
of all             COUNT(*)
datafiles              |
without            ANTI JOIN
deletes           /         \
                 /           \
                SCAN        SCAN
                datafiles   deletes

Testing:
 * Existing tests
 * Added e2e tests

Change-Id: I8172c805121bf91d23fe063f806493afe2f03d41
Reviewed-on: http://gerrit.cloudera.org:8080/19494
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Zoltan Borok-Nagy <[email protected]>


> Optimize count(*) queries for Iceberg V2 tables
> -----------------------------------------------
>
>                 Key: IMPALA-11802
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11802
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Li Penglin
>            Priority: Major
>              Labels: impala-iceberg
>
> Simple {{SELECT count( * ) FROM ice_v2_tbl;}} could be optimized.
> At first we need to investigate if the following is true:
> If a V2 table only has position delete files, then the cardinality is
> {noformat}
> Cardinality(data files) - Cardinality(delete files)
> {noformat}
> If this is true, then we can answer count( * ) queries via a query rewrite 
> similarly to what we do for V1 tables: IMPALA-11279
> If the above is not true, we can still optimize count( * ) queries by:
> {noformat}
>         SUM
>          |
>      UNION ALL
>       /     \
>      /       \
>     /         \
> COUNT(*)     COUNT(*)
>   /                \
> SCAN             ANTI JOIN
> data files         /      \
> without           /        \
> deletes       SCAN         SCAN
>               data files   delete files
>               with deletes
> {noformat}
> The SCAN operator with "data files without deletes" could benefit from count( 
> * ) optimization (they would only need to read file metadata). In the common 
> case (when there are few deletes) this SCAN is in charge of scanning the vast 
> majority of data files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-11802) Optimize count(*) queries for Iceberg V2 tables

Reply via email to