[ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan resolved HIVE-21761.
-------------------------------------
    Resolution: Fixed

All patches committed to master.

> Support table level replication in Hive
> ---------------------------------------
>
>                 Key: HIVE-21761
>                 URL: https://issues.apache.org/jira/browse/HIVE-21761
>             Project: Hive
>          Issue Type: New Feature
>          Components: repl
>            Reporter: Sankar Hariappan
>            Assignee: Sankar Hariappan
>            Priority: Major
>              Labels: DR, Replication
>
> *Requirements:*
> {code:java}
> - User needs to define replication policy to replicate any specific table. 
> This enables user to replicate only the business critical tables instead of 
> replicating all tables which may throttle the network bandwidth, storage and 
> also slow-down Hive replication.
> - User needs to define replication policy using regular expressions (such as 
> db.sales_*) and needs to include additional tables which are non-matching 
> given pattern and exclude some tables which are matching given pattern.
> - User needs to dynamically add/remove tables to the list either by manually 
> changing the replication policy during run time.
> {code}
> *Design:*
> {code:java}
> 1. Hive continue to support DB level replication policy of format <db_name> 
> but logically, we support the policy as <db_name>.'t1|t3| …'.'t*'.
> 2. Regular expression can also be supported as replication policy. For 
> example,
>   a. <db_name>.'<prefix*>'
>   b. <db_name>.'<*suffix>'
>   c. <db_name>.'<prefix*suffix>'
>   d. <db_name>.'<regex>' 
> 3. User can provide include and exclude list to specify the tables to be 
> included in the replication policy.
>   a. Include list specifies the tables to be included.
>   b. Exclude list specifies the tables to be excluded even if it satisfies 
> the expression in include list.
>   c. So the tables included in the policy is a-b.
>   d. For backward compatibility, if no include or exclude list is given, then 
> all the tables will be included in  
>      the policy.
> 4. New format for the Replication policy have 3 parts all separated with Dot 
> (.).
>   a. First part is DB name.
>   b. Second part is included list. Valid java regex within single quote.
>   c. Third part is excluded list. Valid java regex within single quote.
>     - <db_name> -- Full DB replication which is currently supported
>     - <db_name>.'.*?'  -- Full DB replication
>     - <db_name>.'t1|t3'  -- DB replication with static list of tables t1 and 
> t3 included.
>     - <db_name>.'(t1*)|t2'.'t100' -- DB replication with all tables having 
> prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
> t100 which has the prefix t1.
> 5. If the DB property “repl.source.for” is set, then by default all the 
> tables in the DB will be enabled for replication and will continue to archive 
> deleted data to CM path.
> 6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
>   a. REPL DUMP <current_repl_policy> [REPLACE <previous_repl_policy> FROM 
> <last_repl_id> WITH <key_values_list>;
> current_repl_policy and previous_repl_policy can be any format mentioned in 
> Point-4.
>   b. REPLACE clause to be supported to take previous repl policy as input. 
>   c. Rest of the format remains same.
> 7. Now, REPL DUMP on this DB will replicate the tables based on 
> current_repl_policy.
> 8. Single table replication of format <db_name>.t1 is not supported. User can 
> provide the same with <db_name>.'t1' format.
> 9. If any table is added dynamically either due to change in regular 
> expression or added to include list should be bootstrapped. 
>   a. Hive will automatically figure out the list of tables newly included in 
> the list by comparing the current_repl_policy & previous_repl_policy inputs 
> and combine bootstrap dump for added tables as part of incremental dump. As 
> we can combine first incremental with bootstrap dump, it removes the current 
> limitation of target DB being inconsistent after bootstrap unless we run 
> first incremental replication.
>   b. If any table is renamed, then it may gets dynamically added/removed for 
> replication based on defined replication policy + include/exclude list. So, 
> Hive will perform bootstrap for the table which is just included after 
> rename. 
>   c. Also, if renamed table is excluded from replication policy, then need to 
> drop the old table at target as well.
> 10. Only the initial bootstrap load expects the target DB to be empty but the 
> intermediate bootstrap on tables due to regex or inclusion/exclusion list 
> change or renames doesn’t expect the target DB or table to be empty. If any 
> table with same name exist during such bootstrap, the table will be 
> overwritten including data.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to