[ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-21761 started by Sankar Hariappan.
-----------------------------------------------
> Support table level replication in Hive
> ---------------------------------------
>
>                 Key: HIVE-21761
>                 URL: https://issues.apache.org/jira/browse/HIVE-21761
>             Project: Hive
>          Issue Type: New Feature
>          Components: repl
>            Reporter: Sankar Hariappan
>            Assignee: Sankar Hariappan
>            Priority: Major
>              Labels: DR, Replication
>
> *Requirements:*
> {code}
> - User needs to define replication policy to replicate any specific table. 
> This enables user to replicate only the business critical tables instead of 
> replicating all tables which may throttle the network bandwidth, storage and 
> also slow-down Hive replication.
> - User needs to define replication policy using regular expressions (such as 
> db.sales_*) and needs to include additional tables which are non-matching 
> given pattern and exclude some tables which are matching given pattern.
> - User needs to dynamically add/remove tables to the list either by manually 
> changing the replication policy during run time.
> {code}
> *Design:*
> {code}
> 1. Hive continue to support DB level replication policy of format <db_name>.* 
> but logically, we support the policy as <db_name>.(t1, t3, …).
> 2. Regular expression can also be supported as replication policy. For 
> example,
>   a. <db_name>.<prefix*>, 
>   b. <db_name>.<*suffix>, 
>   c. <db_name>.<prefix*suffix>. 
> 3. If regular expression is provided as replication policy, then Hive also 
> accepts include and exclude lists as input which also helps to dynamically 
> add/remove tables for replication.
>   a. Exclude list specifies the tables to be excluded even if it satisfies 
> the regular expression. 
>   b. Include list specifies the tables to be included in addition to the 
> tables satisfying the regular expression. 
> 4. New format for the Replication policy have 3 parts all separated with Dot 
> (.).
>   a. First part is DB name.
>   b. Second part is included list. Comma separated table names/regex with in 
> square brackets[].
>   c. Third part is excluded list. Comma separated table names/regex with in 
> square brackets[].
>     - <db_name>   -- Full DB replication
>     - <db_name>.*    -- Full DB replication
>     - <db_name>.[t1, t3]  -- DB replication with static list of tables t1 and 
> t3 included.
>     - <db_name>.[t1*, t2].[t100] -- DB replication with all tables having 
> prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
> t100 which has the prefix t1.
> 5. If the DB property “repl.source.for” is set, then by default all the 
> tables in the DB will be enabled for replication and will continue to archive 
> deleted data to CM path.
> 6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
>   a. REPL DUMP <current_repl_policy> [REPLACE <previous_repl_policy> FROM 
> <last_repl_id> WITH <key_values_list>;
> current_repl_policy and previous_repl_policy can be any format mentioned in 
> Point-4.
>   b. REPLACE clause to be supported to take previous repl policy as input. 
>   c. Rest of the format remains same.
> 7. Now, REPL DUMP on this DB will replicate the tables based on 
> current_repl_policy.
> 8. If any table is added dynamically either due to change in regular 
> expression or added to include list should be bootstrapped. 
>   a. Hive will automatically figure out the list of tables newly included in 
> the list by comparing the current_repl_policy & previous_repl_policy inputs 
> and combine bootstrap dump for added tables as part of incremental dump. As 
> we can combine first incremental with bootstrap dump, it removes the current 
> limitation of target DB being inconsistent after bootstrap unless we run 
> first incremental replication.
>   b. If any table is renamed, then it may gets dynamically added/removed for 
> replication based on defined replication policy + include/exclude list. So, 
> Hive will perform bootstrap for the table which is just included after 
> rename. 
>   c. Also, if renamed table is excluded from replication policy, then need to 
> drop the old table at target as well.
> 9. Only the initial bootstrap load expects the target DB to be empty but the 
> intermediate bootstrap on tables due to regex or inclusion/exclusion list 
> change or renames doesn’t expect the target DB or table to be empty. If any 
> table with same name exist during such bootstrap, the table will be 
> overwritten including data.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to