[ https://issues.apache.org/jira/browse/IMPALA-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Quanlong Huang updated IMPALA-9695: ----------------------------------- Epic Link: (was: IMPALA-13915) > Support incomplete partition spec in REFRESH statement > ------------------------------------------------------ > > Key: IMPALA-9695 > URL: https://issues.apache.org/jira/browse/IMPALA-9695 > Project: IMPALA > Issue Type: New Feature > Components: Catalog > Reporter: Quanlong Huang > Priority: Critical > > We support explicitly specify a partition in the REFRESH statement. When > users have several partitions to refresh, they have to trigger several > REFRESH statements. Each REFRESH statement requires the table lock so they'll > be executed in the catalogd one by one. What's worse, the table is updated > (catalog version bumped) several times, which may cause catalogd propagates > it several times to the coordinators. It's bad for huge tables that contain a > large number of partitions. Their catalog objects have huge size since > catalogd can't send incremental updates for only changed partitions. > A possible scenario is hourly partitioned tables that have more than one > level partition keys: > {code:sql} > create table hourly_part_tbl (id int, msg string) > partitioned by (hour_id bigint, event_type bigint) > {code} > Let's say there are 20 event_types. Every hour there will be 10 partitions > generated with a new hour_id. If the retention time for this table is 2 > years, the total number of partitions will be 2 * 365 * 24 * 20 = 175,200. > The catalog object size for this table wil be huge, especially there will be > many columns and hence incrementa stats in practise. > Every hour, users have to run 20 REFRESH statements one by one on this table. > The catalog server will send 20 updates to coordinators for this table. It's > possible that catalogd is always busy in loading metadata for this table in a > busy cluster (with many other tables). > One possible solution is using REFRESH without the partition spec. > Unfortunately, we still load FileStatus for all loaded partitions. It's > possible that this single statement can't finish in an hour. > Another solution is support REFRESH statement with incomplete partition spec. > So users can use one statement: > {code:java} > REFRESH hourly_part_tbl PARTITION(hour_id=xxx); > {code} > Then catalogd only needs to acquire the table lock once and send its catalog > update once. > It'd also be usefull if we support non-equality predicates in the partition > spec: > {code:sql} > REFRESH hourly_part_tbl PARTITION(hour_id >= xxx); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org