[ 
https://issues.apache.org/jira/browse/IMPALA-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-9695:
-----------------------------------
    Epic Link:   (was: IMPALA-13915)

> Support incomplete partition spec in REFRESH statement
> ------------------------------------------------------
>
>                 Key: IMPALA-9695
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9695
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Catalog
>            Reporter: Quanlong Huang
>            Priority: Critical
>
> We support explicitly specify a partition in the REFRESH statement. When 
> users have several partitions to refresh, they have to trigger several 
> REFRESH statements. Each REFRESH statement requires the table lock so they'll 
> be executed in the catalogd one by one. What's worse, the table is updated 
> (catalog version bumped) several times, which may cause catalogd propagates 
> it several times to the coordinators. It's bad for huge tables that contain a 
> large number of partitions. Their catalog objects have huge size since 
> catalogd can't send incremental updates for only changed partitions.
> A possible scenario is hourly partitioned tables that have more than one 
> level partition keys:
> {code:sql}
> create table hourly_part_tbl (id int, msg string)
> partitioned by (hour_id bigint, event_type bigint)
> {code}
> Let's say there are 20 event_types. Every hour there will be 10 partitions 
> generated with a new hour_id. If the retention time for this table is 2 
> years, the total number of partitions will be 2 * 365 * 24 * 20 = 175,200. 
> The catalog object size for this table wil be huge, especially there will be 
> many columns and hence incrementa stats in practise.
> Every hour, users have to run 20 REFRESH statements one by one on this table. 
> The catalog server will send 20 updates to coordinators for this table. It's 
> possible that catalogd is always busy in loading metadata for this table in a 
> busy cluster (with many other tables).
> One possible solution is using REFRESH without the partition spec. 
> Unfortunately, we still load FileStatus for all loaded partitions. It's 
> possible that this single statement can't finish in an hour.
> Another solution is support REFRESH statement with incomplete partition spec. 
> So users can use one statement:
> {code:java}
> REFRESH hourly_part_tbl PARTITION(hour_id=xxx);
> {code}
> Then catalogd only needs to acquire the table lock once and send its catalog 
> update once.
> It'd also be usefull if we support non-equality predicates in the partition 
> spec:
> {code:sql}
> REFRESH hourly_part_tbl PARTITION(hour_id >= xxx);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to