[ https://issues.apache.org/jira/browse/IMPALA-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on IMPALA-11802 started by Li Penglin. ------------------------------------------- > Optimize count(*) queries for Iceberg V2 tables > ----------------------------------------------- > > Key: IMPALA-11802 > URL: https://issues.apache.org/jira/browse/IMPALA-11802 > Project: IMPALA > Issue Type: Bug > Components: Frontend > Reporter: Zoltán Borók-Nagy > Assignee: Li Penglin > Priority: Major > Labels: impala-iceberg > > Simple {{SELECT count( * ) FROM ice_v2_tbl;}} could be optimized. > At first we need to investigate if the following is true: > If a V2 table only has position delete files, then the cardinality is > {noformat} > Cardinality(data files) - Cardinality(delete files) > {noformat} > If this is true, then we can answer count( * ) queries via a query rewrite > similarly to what we do for V1 tables: IMPALA-11279 > If the above is not true, we can still optimize count( * ) queries by: > {noformat} > SUM > | > UNION ALL > / \ > / \ > / \ > COUNT(*) COUNT(*) > / \ > SCAN ANTI JOIN > data files / \ > without / \ > deletes SCAN SCAN > data files delete files > with deletes > {noformat} > The SCAN operator with "data files without deletes" could benefit from count( > * ) optimization (they would only need to read file metadata). In the common > case (when there are few deletes) this SCAN is in charge of scanning the vast > majority of data files. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org