[ https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on HIVE-21761 started by Sankar Hariappan. ----------------------------------------------- > Support table level replication in Hive > --------------------------------------- > > Key: HIVE-21761 > URL: https://issues.apache.org/jira/browse/HIVE-21761 > Project: Hive > Issue Type: New Feature > Components: repl > Reporter: Sankar Hariappan > Assignee: Sankar Hariappan > Priority: Major > Labels: DR, Replication > > *Requirements:* > {code} > - User needs to define replication policy to replicate any specific table. > This enables user to replicate only the business critical tables instead of > replicating all tables which may throttle the network bandwidth, storage and > also slow-down Hive replication. > - User needs to define replication policy using regular expressions (such as > db.sales_*) and needs to include additional tables which are non-matching > given pattern and exclude some tables which are matching given pattern. > - User needs to dynamically add/remove tables to the list either by manually > changing the replication policy during run time. > {code} > *Design:* > {code} > 1. Hive continue to support DB level replication policy of format <db_name>.* > but logically, we support the policy as <db_name>.(t1, t3, …). > 2. Regular expression can also be supported as replication policy. For > example, > a. <db_name>.<prefix*>, > b. <db_name>.<*suffix>, > c. <db_name>.<prefix*suffix>. > 3. If regular expression is provided as replication policy, then Hive also > accepts include and exclude lists as input which also helps to dynamically > add/remove tables for replication. > a. Exclude list specifies the tables to be excluded even if it satisfies > the regular expression. > b. Include list specifies the tables to be included in addition to the > tables satisfying the regular expression. > 4. New format for the Replication policy have 3 parts all separated with Dot > (.). > a. First part is DB name. > b. Second part is included list. Comma separated table names/regex with in > square brackets[]. > c. Third part is excluded list. Comma separated table names/regex with in > square brackets[]. > - <db_name> -- Full DB replication > - <db_name>.* -- Full DB replication > - <db_name>.[t1, t3] -- DB replication with static list of tables t1 and > t3 included. > - <db_name>.[t1*, t2].[t100] -- DB replication with all tables having > prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude > t100 which has the prefix t1. > 5. If the DB property “repl.source.for” is set, then by default all the > tables in the DB will be enabled for replication and will continue to archive > deleted data to CM path. > 6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause. > a. REPL DUMP <current_repl_policy> [REPLACE <previous_repl_policy> FROM > <last_repl_id> WITH <key_values_list>; > current_repl_policy and previous_repl_policy can be any format mentioned in > Point-4. > b. REPLACE clause to be supported to take previous repl policy as input. > c. Rest of the format remains same. > 7. Now, REPL DUMP on this DB will replicate the tables based on > current_repl_policy. > 8. If any table is added dynamically either due to change in regular > expression or added to include list should be bootstrapped. > a. Hive will automatically figure out the list of tables newly included in > the list by comparing the current_repl_policy & previous_repl_policy inputs > and combine bootstrap dump for added tables as part of incremental dump. As > we can combine first incremental with bootstrap dump, it removes the current > limitation of target DB being inconsistent after bootstrap unless we run > first incremental replication. > b. If any table is renamed, then it may gets dynamically added/removed for > replication based on defined replication policy + include/exclude list. So, > Hive will perform bootstrap for the table which is just included after > rename. > c. Also, if renamed table is excluded from replication policy, then need to > drop the old table at target as well. > 9. Only the initial bootstrap load expects the target DB to be empty but the > intermediate bootstrap on tables due to regex or inclusion/exclusion list > change or renames doesn’t expect the target DB or table to be empty. If any > table with same name exist during such bootstrap, the table will be > overwritten including data. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)