[jira] [Updated] (HIVE-21761) Support table level replication in Hive

2019-10-15 Thread Sankar Hariappan (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-21761:

Fix Version/s: 4.0.0

> Support table level replication in Hive
> ---
>
> Key: HIVE-21761
> URL: https://issues.apache.org/jira/browse/HIVE-21761
> Project: Hive
>  Issue Type: New Feature
>  Components: repl
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>Priority: Major
>  Labels: DR, Replication
> Fix For: 4.0.0
>
>
> *Requirements:*
> {code:java}
> - User needs to define replication policy to replicate any specific table. 
> This enables user to replicate only the business critical tables instead of 
> replicating all tables which may throttle the network bandwidth, storage and 
> also slow-down Hive replication.
> - User needs to define replication policy using regular expressions (such as 
> db.sales_*) and needs to include additional tables which are non-matching 
> given pattern and exclude some tables which are matching given pattern.
> - User needs to dynamically add/remove tables to the list either by manually 
> changing the replication policy during run time.
> {code}
> *Design:*
> {code:java}
> 1. Hive continue to support DB level replication policy of format  
> but logically, we support the policy as .'t1|t3| …'.'t*'.
> 2. Regular expression can also be supported as replication policy. For 
> example,
>   a. .''
>   b. .'<*suffix>'
>   c. .''
>   d. .'' 
> 3. User can provide include and exclude list to specify the tables to be 
> included in the replication policy.
>   a. Include list specifies the tables to be included.
>   b. Exclude list specifies the tables to be excluded even if it satisfies 
> the expression in include list.
>   c. So the tables included in the policy is a-b.
>   d. For backward compatibility, if no include or exclude list is given, then 
> all the tables will be included in  
>  the policy.
> 4. New format for the Replication policy have 3 parts all separated with Dot 
> (.).
>   a. First part is DB name.
>   b. Second part is included list. Valid java regex within single quote.
>   c. Third part is excluded list. Valid java regex within single quote.
> -  -- Full DB replication which is currently supported
> - .'.*?'  -- Full DB replication
> - .'t1|t3'  -- DB replication with static list of tables t1 and 
> t3 included.
> - .'(t1*)|t2'.'t100' -- DB replication with all tables having 
> prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
> t100 which has the prefix t1.
> 5. If the DB property “repl.source.for” is set, then by default all the 
> tables in the DB will be enabled for replication and will continue to archive 
> deleted data to CM path.
> 6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
>   a. REPL DUMP  [REPLACE  FROM 
>  WITH ;
> current_repl_policy and previous_repl_policy can be any format mentioned in 
> Point-4.
>   b. REPLACE clause to be supported to take previous repl policy as input. 
>   c. Rest of the format remains same.
> 7. Now, REPL DUMP on this DB will replicate the tables based on 
> current_repl_policy.
> 8. Single table replication of format .t1 is not supported. User can 
> provide the same with .'t1' format.
> 9. If any table is added dynamically either due to change in regular 
> expression or added to include list should be bootstrapped. 
>   a. Hive will automatically figure out the list of tables newly included in 
> the list by comparing the current_repl_policy & previous_repl_policy inputs 
> and combine bootstrap dump for added tables as part of incremental dump. As 
> we can combine first incremental with bootstrap dump, it removes the current 
> limitation of target DB being inconsistent after bootstrap unless we run 
> first incremental replication.
>   b. If any table is renamed, then it may gets dynamically added/removed for 
> replication based on defined replication policy + include/exclude list. So, 
> Hive will perform bootstrap for the table which is just included after 
> rename. 
>   c. Also, if renamed table is excluded from replication policy, then need to 
> drop the old table at target as well.
> 10. Only the initial bootstrap load expects the target DB to be empty but the 
> intermediate bootstrap on tables due to regex or inclusion/exclusion list 
> change or renames doesn’t expect the target DB or table to be empty. If any 
> table with same name exist during such bootstrap, the table will be 
> overwritten including data.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-21761) Support table level replication in Hive

2019-07-08 Thread mahesh kumar behera (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-21761:
---
Description: 
*Requirements:*
{code:java}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}
*Design:*
{code:java}
1. Hive continue to support DB level replication policy of format  but 
logically, we support the policy as .'t1|t3| …'.'t*'.

2. Regular expression can also be supported as replication policy. For example,
  a. .''
  b. .'<*suffix>'
  c. .''
  d. .'' 

3. User can provide include and exclude list to specify the tables to be 
included in the replication policy.
  a. Include list specifies the tables to be included.
  b. Exclude list specifies the tables to be excluded even if it satisfies the 
expression in include list.
  c. So the tables included in the policy is a-b.
  d. For backward compatibility, if no include or exclude list is given, then 
all the tables will be included in  
 the policy.

4. New format for the Replication policy have 3 parts all separated with Dot 
(.).
  a. First part is DB name.
  b. Second part is included list. Valid java regex within single quote.
  c. Third part is excluded list. Valid java regex within single quote.
-  -- Full DB replication which is currently supported
- .'.*?'  -- Full DB replication
- .'t1|t3'  -- DB replication with static list of tables t1 and t3 
included.
- .'(t1*)|t2'.'t100' -- DB replication with all tables having 
prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
t100 which has the prefix t1.

5. If the DB property “repl.source.for” is set, then by default all the tables 
in the DB will be enabled for replication and will continue to archive deleted 
data to CM path.

6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
  a. REPL DUMP  [REPLACE  FROM 
 WITH ;
current_repl_policy and previous_repl_policy can be any format mentioned in 
Point-4.
  b. REPLACE clause to be supported to take previous repl policy as input. 
  c. Rest of the format remains same.

7. Now, REPL DUMP on this DB will replicate the tables based on 
current_repl_policy.

8. Single table replication of format .t1 is not supported. User can 
provide the same with .'t1' format.

9. If any table is added dynamically either due to change in regular expression 
or added to include list should be bootstrapped. 
  a. Hive will automatically figure out the list of tables newly included in 
the list by comparing the current_repl_policy & previous_repl_policy inputs and 
combine bootstrap dump for added tables as part of incremental dump. As we can 
combine first incremental with bootstrap dump, it removes the current 
limitation of target DB being inconsistent after bootstrap unless we run first 
incremental replication.
  b. If any table is renamed, then it may gets dynamically added/removed for 
replication based on defined replication policy + include/exclude list. So, 
Hive will perform bootstrap for the table which is just included after rename. 
  c. Also, if renamed table is excluded from replication policy, then need to 
drop the old table at target as well.

10. Only the initial bootstrap load expects the target DB to be empty but the 
intermediate bootstrap on tables due to regex or inclusion/exclusion list 
change or renames doesn’t expect the target DB or table to be empty. If any 
table with same name exist during such bootstrap, the table will be overwritten 
including data.
{code}

  was:
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).

2. Regular expression can also be supported as replication 

[jira] [Updated] (HIVE-21761) Support table level replication in Hive

2019-06-15 Thread Sankar Hariappan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-21761:

Description: 
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).

2. Regular expression can also be supported as replication policy. For example,
  a. .[], 
  b. .[<*suffix>], 
  c. .[]. 

3. If regular expression is provided as replication policy, then Hive also 
accepts include and exclude lists as input which also helps to dynamically 
add/remove tables for replication.
  a. Exclude list specifies the tables to be excluded even if it satisfies the 
regular expression. 
  b. Include list specifies the tables to be included in addition to the tables 
satisfying the regular expression. 

4. New format for the Replication policy have 3 parts all separated with Dot 
(.).
  a. First part is DB name.
  b. Second part is included list. Comma separated table names/regex with in 
square brackets[].  If square brackets are not there, then it is treated as 
single table replication which skips DB level events.
  c. Third part is excluded list. Comma separated table names/regex with in 
square brackets[].
-  -- Full DB replication which is currently supported
- .['.*?']  -- Full DB replication
- .[] -- Replicate just functions and not include any tables.
- .['t1', 't3']  -- DB replication with static list of tables t1 
and t3 included.
- .['t1*', 't2'].['t100'] -- DB replication with all tables having 
prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
t100 which has the prefix t1.

5. If the DB property “repl.source.for” is set, then by default all the tables 
in the DB will be enabled for replication and will continue to archive deleted 
data to CM path.

6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
  a. REPL DUMP  [REPLACE  FROM 
 WITH ;
current_repl_policy and previous_repl_policy can be any format mentioned in 
Point-4.
  b. REPLACE clause to be supported to take previous repl policy as input. 
  c. Rest of the format remains same.

7. Now, REPL DUMP on this DB will replicate the tables based on 
current_repl_policy.

8. If any table is added dynamically either due to change in regular expression 
or added to include list should be bootstrapped. 
  a. Hive will automatically figure out the list of tables newly included in 
the list by comparing the current_repl_policy & previous_repl_policy inputs and 
combine bootstrap dump for added tables as part of incremental dump. As we can 
combine first incremental with bootstrap dump, it removes the current 
limitation of target DB being inconsistent after bootstrap unless we run first 
incremental replication.
  b. If any table is renamed, then it may gets dynamically added/removed for 
replication based on defined replication policy + include/exclude list. So, 
Hive will perform bootstrap for the table which is just included after rename. 
  c. Also, if renamed table is excluded from replication policy, then need to 
drop the old table at target as well.

9. Only the initial bootstrap load expects the target DB to be empty but the 
intermediate bootstrap on tables due to regex or inclusion/exclusion list 
change or renames doesn’t expect the target DB or table to be empty. If any 
table with same name exist during such bootstrap, the table will be overwritten 
including data.
{code}


  was:
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).

2. 

[jira] [Updated] (HIVE-21761) Support table level replication in Hive

2019-06-10 Thread Sankar Hariappan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-21761:

Description: 
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).

2. Regular expression can also be supported as replication policy. For example,
  a. .[], 
  b. .[<*suffix>], 
  c. .[]. 

3. If regular expression is provided as replication policy, then Hive also 
accepts include and exclude lists as input which also helps to dynamically 
add/remove tables for replication.
  a. Exclude list specifies the tables to be excluded even if it satisfies the 
regular expression. 
  b. Include list specifies the tables to be included in addition to the tables 
satisfying the regular expression. 

4. New format for the Replication policy have 3 parts all separated with Dot 
(.).
  a. First part is DB name.
  b. Second part is included list. Comma separated table names/regex with in 
square brackets[].  If square brackets are not there, then it is treated as 
single table replication which skips DB level events.
  c. Third part is excluded list. Comma separated table names/regex with in 
square brackets[].
-  -- Full DB replication which is currently supported
- .['.*?']  -- Full DB replication
- .[] -- Replicate just functions and not include any tables.
- .['t1', 't3']  -- DB replication with static list of tables t1 
and t3 included.
- .['t1*', 't2'].['t100'] -- DB replication with all tables having 
prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
t100 which has the prefix t1.

5. If the DB property “repl.source.for” is set, then by default all the tables 
in the DB will be enabled for replication and will continue to archive deleted 
data to CM path.

6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
  a. REPL DUMP  [REPLACE  FROM 
 WITH ;
current_repl_policy and previous_repl_policy can be any format mentioned in 
Point-4.
  b. REPLACE clause to be supported to take previous repl policy as input. 
  c. Rest of the format remains same.

7. Now, REPL DUMP on this DB will replicate the tables based on 
current_repl_policy.

8. Single table replication of format .t1 doesn’t allow changing the 
policy dynamically. So REPLACE clause is not allowed if previous_repl_policy of 
this format.

9. If any table is added dynamically either due to change in regular expression 
or added to include list should be bootstrapped. 
  a. Hive will automatically figure out the list of tables newly included in 
the list by comparing the current_repl_policy & previous_repl_policy inputs and 
combine bootstrap dump for added tables as part of incremental dump. As we can 
combine first incremental with bootstrap dump, it removes the current 
limitation of target DB being inconsistent after bootstrap unless we run first 
incremental replication.
  b. If any table is renamed, then it may gets dynamically added/removed for 
replication based on defined replication policy + include/exclude list. So, 
Hive will perform bootstrap for the table which is just included after rename. 
  c. Also, if renamed table is excluded from replication policy, then need to 
drop the old table at target as well.

10. Only the initial bootstrap load expects the target DB to be empty but the 
intermediate bootstrap on tables due to regex or inclusion/exclusion list 
change or renames doesn’t expect the target DB or table to be empty. If any 
table with same name exist during such bootstrap, the table will be overwritten 
including data.
{code}


  was:
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy 

[jira] [Updated] (HIVE-21761) Support table level replication in Hive

2019-06-07 Thread Sankar Hariappan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-21761:

Description: 
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).

2. Regular expression can also be supported as replication policy. For example,
  a. .[], 
  b. .[<*suffix>], 
  c. .[]. 

3. If regular expression is provided as replication policy, then Hive also 
accepts include and exclude lists as input which also helps to dynamically 
add/remove tables for replication.
  a. Exclude list specifies the tables to be excluded even if it satisfies the 
regular expression. 
  b. Include list specifies the tables to be included in addition to the tables 
satisfying the regular expression. 

4. New format for the Replication policy have 3 parts all separated with Dot 
(.).
  a. First part is DB name.
  b. Second part is included list. Comma separated table names/regex with in 
square brackets[].  If square brackets are not there, then it is treated as 
single table replication which skips DB level events.
  c. Third part is excluded list. Comma separated table names/regex with in 
square brackets[].
-  -- Full DB replication which is currently supported
- .[]  - Full DB replication
- .['.*?']  - Full DB replication
- .t1 -- Single table replication (DB events excluded) which is 
currently supported
- .['t1', 't3']  -- DB replication with static list of tables t1 
and t3 included.
- .['t1*', 't2'].['t100'] -- DB replication with all tables having 
prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
t100 which has the prefix t1.

5. If the DB property “repl.source.for” is set, then by default all the tables 
in the DB will be enabled for replication and will continue to archive deleted 
data to CM path.

6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
  a. REPL DUMP  [REPLACE  FROM 
 WITH ;
current_repl_policy and previous_repl_policy can be any format mentioned in 
Point-4.
  b. REPLACE clause to be supported to take previous repl policy as input. 
  c. Rest of the format remains same.

7. Now, REPL DUMP on this DB will replicate the tables based on 
current_repl_policy.

8. Single table replication of format .t1 doesn’t allow changing the 
policy dynamically. So REPLACE clause is not allowed if previous_repl_policy of 
this format.

9. If any table is added dynamically either due to change in regular expression 
or added to include list should be bootstrapped. 
  a. Hive will automatically figure out the list of tables newly included in 
the list by comparing the current_repl_policy & previous_repl_policy inputs and 
combine bootstrap dump for added tables as part of incremental dump. As we can 
combine first incremental with bootstrap dump, it removes the current 
limitation of target DB being inconsistent after bootstrap unless we run first 
incremental replication.
  b. If any table is renamed, then it may gets dynamically added/removed for 
replication based on defined replication policy + include/exclude list. So, 
Hive will perform bootstrap for the table which is just included after rename. 
  c. Also, if renamed table is excluded from replication policy, then need to 
drop the old table at target as well.

10. Only the initial bootstrap load expects the target DB to be empty but the 
intermediate bootstrap on tables due to regex or inclusion/exclusion list 
change or renames doesn’t expect the target DB or table to be empty. If any 
table with same name exist during such bootstrap, the table will be overwritten 
including data.
{code}


  was:
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list 

[jira] [Updated] (HIVE-21761) Support table level replication in Hive

2019-06-05 Thread Sankar Hariappan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-21761:

Description: 
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).

2. Regular expression can also be supported as replication policy. For example,
  a. .[], 
  b. .[<*suffix>], 
  c. .[]. 

3. If regular expression is provided as replication policy, then Hive also 
accepts include and exclude lists as input which also helps to dynamically 
add/remove tables for replication.
  a. Exclude list specifies the tables to be excluded even if it satisfies the 
regular expression. 
  b. Include list specifies the tables to be included in addition to the tables 
satisfying the regular expression. 

4. New format for the Replication policy have 3 parts all separated with Dot 
(.).
  a. First part is DB name.
  b. Second part is included list. Comma separated table names/regex with in 
square brackets[].  If square brackets are not there, then it is treated as 
single table replication which skips DB level events.
  c. Third part is excluded list. Comma separated table names/regex with in 
square brackets[].
-  -- Full DB replication which is currently supported
- .[]  - Full DB replication
- .['*']  - Full DB replication
- .t1 -- Single table replication (DB events excluded) which is 
currently supported
- .['t1', 't3']  -- DB replication with static list of tables t1 
and t3 included.
- .['t1*', 't2'].['t100'] -- DB replication with all tables having 
prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
t100 which has the prefix t1.

5. If the DB property “repl.source.for” is set, then by default all the tables 
in the DB will be enabled for replication and will continue to archive deleted 
data to CM path.

6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
  a. REPL DUMP  [REPLACE  FROM 
 WITH ;
current_repl_policy and previous_repl_policy can be any format mentioned in 
Point-4.
  b. REPLACE clause to be supported to take previous repl policy as input. 
  c. Rest of the format remains same.

7. Now, REPL DUMP on this DB will replicate the tables based on 
current_repl_policy.

8. Single table replication of format .t1 doesn’t allow changing the 
policy dynamically. So REPLACE clause is not allowed if previous_repl_policy of 
this format.

9. If any table is added dynamically either due to change in regular expression 
or added to include list should be bootstrapped. 
  a. Hive will automatically figure out the list of tables newly included in 
the list by comparing the current_repl_policy & previous_repl_policy inputs and 
combine bootstrap dump for added tables as part of incremental dump. As we can 
combine first incremental with bootstrap dump, it removes the current 
limitation of target DB being inconsistent after bootstrap unless we run first 
incremental replication.
  b. If any table is renamed, then it may gets dynamically added/removed for 
replication based on defined replication policy + include/exclude list. So, 
Hive will perform bootstrap for the table which is just included after rename. 
  c. Also, if renamed table is excluded from replication policy, then need to 
drop the old table at target as well.

10. Only the initial bootstrap load expects the target DB to be empty but the 
intermediate bootstrap on tables due to regex or inclusion/exclusion list 
change or renames doesn’t expect the target DB or table to be empty. If any 
table with same name exist during such bootstrap, the table will be overwritten 
including data.
{code}


  was:
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list 

[jira] [Updated] (HIVE-21761) Support table level replication in Hive

2019-06-04 Thread Sankar Hariappan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-21761:

Description: 
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).

2. Regular expression can also be supported as replication policy. For example,
  a. .[], 
  b. .[<*suffix>], 
  c. .[]. 

3. If regular expression is provided as replication policy, then Hive also 
accepts include and exclude lists as input which also helps to dynamically 
add/remove tables for replication.
  a. Exclude list specifies the tables to be excluded even if it satisfies the 
regular expression. 
  b. Include list specifies the tables to be included in addition to the tables 
satisfying the regular expression. 

4. New format for the Replication policy have 3 parts all separated with Dot 
(.).
  a. First part is DB name.
  b. Second part is included list. Comma separated table names/regex with in 
square brackets[].  If square brackets are not there, then it is treated as 
single table replication which skips DB level events.
  c. Third part is excluded list. Comma separated table names/regex with in 
square brackets[].
-  -- Full DB replication which is currently supported
- .[*]  - Full DB replication
- .t1 -- Single table replication (DB events excluded) which is 
currently supported
- .[t1, t3]  -- DB replication with static list of tables t1 and 
t3 included.
- .[t1*, t2].[t100] -- DB replication with all tables having 
prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
t100 which has the prefix t1.

5. If the DB property “repl.source.for” is set, then by default all the tables 
in the DB will be enabled for replication and will continue to archive deleted 
data to CM path.

6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
  a. REPL DUMP  [REPLACE  FROM 
 WITH ;
current_repl_policy and previous_repl_policy can be any format mentioned in 
Point-4.
  b. REPLACE clause to be supported to take previous repl policy as input. 
  c. Rest of the format remains same.

7. Now, REPL DUMP on this DB will replicate the tables based on 
current_repl_policy.

8. Single table replication of format .t1 doesn’t allow changing the 
policy dynamically. So REPLACE clause is not allowed if previous_repl_policy of 
this format.

9. If any table is added dynamically either due to change in regular expression 
or added to include list should be bootstrapped. 
  a. Hive will automatically figure out the list of tables newly included in 
the list by comparing the current_repl_policy & previous_repl_policy inputs and 
combine bootstrap dump for added tables as part of incremental dump. As we can 
combine first incremental with bootstrap dump, it removes the current 
limitation of target DB being inconsistent after bootstrap unless we run first 
incremental replication.
  b. If any table is renamed, then it may gets dynamically added/removed for 
replication based on defined replication policy + include/exclude list. So, 
Hive will perform bootstrap for the table which is just included after rename. 
  c. Also, if renamed table is excluded from replication policy, then need to 
drop the old table at target as well.

10. Only the initial bootstrap load expects the target DB to be empty but the 
intermediate bootstrap on tables due to regex or inclusion/exclusion list 
change or renames doesn’t expect the target DB or table to be empty. If any 
table with same name exist during such bootstrap, the table will be overwritten 
including data.
{code}


  was:
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication 

[jira] [Updated] (HIVE-21761) Support table level replication in Hive

2019-06-04 Thread Sankar Hariappan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-21761:

Description: 
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).

2. Regular expression can also be supported as replication policy. For example,
  a. .[], 
  b. .[<*suffix>], 
  c. .[]. 

3. If regular expression is provided as replication policy, then Hive also 
accepts include and exclude lists as input which also helps to dynamically 
add/remove tables for replication.
  a. Exclude list specifies the tables to be excluded even if it satisfies the 
regular expression. 
  b. Include list specifies the tables to be included in addition to the tables 
satisfying the regular expression. 

4. New format for the Replication policy have 3 parts all separated with Dot 
(.).
  a. First part is DB name.
  b. Second part is included list. Comma separated table names/regex with in 
square brackets[].
  c. Third part is excluded list. Comma separated table names/regex with in 
square brackets[].
-  -- Full DB replication which is currently supported
- .[*]  - Full DB replication
- .t1 -- Single table replication (DB events excluded) which is 
currently supported
- .[t1, t3]  -- DB replication with static list of tables t1 and 
t3 included.
- .[t1*, t2].[t100] -- DB replication with all tables having 
prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
t100 which has the prefix t1.

5. If the DB property “repl.source.for” is set, then by default all the tables 
in the DB will be enabled for replication and will continue to archive deleted 
data to CM path.

6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
  a. REPL DUMP  [REPLACE  FROM 
 WITH ;
current_repl_policy and previous_repl_policy can be any format mentioned in 
Point-4.
  b. REPLACE clause to be supported to take previous repl policy as input. 
  c. Rest of the format remains same.

7. Now, REPL DUMP on this DB will replicate the tables based on 
current_repl_policy.

8. Single table replication of format .t1 doesn’t allow changing the 
policy dynamically. So REPLACE clause is not allowed if previous_repl_policy of 
this format.

9. If any table is added dynamically either due to change in regular expression 
or added to include list should be bootstrapped. 
  a. Hive will automatically figure out the list of tables newly included in 
the list by comparing the current_repl_policy & previous_repl_policy inputs and 
combine bootstrap dump for added tables as part of incremental dump. As we can 
combine first incremental with bootstrap dump, it removes the current 
limitation of target DB being inconsistent after bootstrap unless we run first 
incremental replication.
  b. If any table is renamed, then it may gets dynamically added/removed for 
replication based on defined replication policy + include/exclude list. So, 
Hive will perform bootstrap for the table which is just included after rename. 
  c. Also, if renamed table is excluded from replication policy, then need to 
drop the old table at target as well.

10. Only the initial bootstrap load expects the target DB to be empty but the 
intermediate bootstrap on tables due to regex or inclusion/exclusion list 
change or renames doesn’t expect the target DB or table to be empty. If any 
table with same name exist during such bootstrap, the table will be overwritten 
including data.
{code}


  was:
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format 

[jira] [Updated] (HIVE-21761) Support table level replication in Hive

2019-06-04 Thread Sankar Hariappan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-21761:

Description: 
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).

2. Regular expression can also be supported as replication policy. For example,
  a. .[], 
  b. .[<*suffix>], 
  c. .[]. 

3. If regular expression is provided as replication policy, then Hive also 
accepts include and exclude lists as input which also helps to dynamically 
add/remove tables for replication.
  a. Exclude list specifies the tables to be excluded even if it satisfies the 
regular expression. 
  b. Include list specifies the tables to be included in addition to the tables 
satisfying the regular expression. 

4. New format for the Replication policy have 3 parts all separated with Dot 
(.).
  a. First part is DB name.
  b. Second part is included list. Comma separated table names/regex with in 
square brackets[].
  c. Third part is excluded list. Comma separated table names/regex with in 
square brackets[].
--- Full DB replication
- .*-- Full DB replication
- .[t1, t3]  -- DB replication with static list of tables t1 and 
t3 included.
- .[t1*, t2].[t100] -- DB replication with all tables having 
prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
t100 which has the prefix t1.

5. If the DB property “repl.source.for” is set, then by default all the tables 
in the DB will be enabled for replication and will continue to archive deleted 
data to CM path.

6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
  a. REPL DUMP  [REPLACE  FROM 
 WITH ;
current_repl_policy and previous_repl_policy can be any format mentioned in 
Point-4.
  b. REPLACE clause to be supported to take previous repl policy as input. 
  c. Rest of the format remains same.

7. Now, REPL DUMP on this DB will replicate the tables based on 
current_repl_policy.

8. Single table replication of format .t1 doesn’t allow changing the 
policy dynamically. So REPLACE clause is not allowed if previous_repl_policy of 
this format.

9. If any table is added dynamically either due to change in regular expression 
or added to include list should be bootstrapped. 
  a. Hive will automatically figure out the list of tables newly included in 
the list by comparing the current_repl_policy & previous_repl_policy inputs and 
combine bootstrap dump for added tables as part of incremental dump. As we can 
combine first incremental with bootstrap dump, it removes the current 
limitation of target DB being inconsistent after bootstrap unless we run first 
incremental replication.
  b. If any table is renamed, then it may gets dynamically added/removed for 
replication based on defined replication policy + include/exclude list. So, 
Hive will perform bootstrap for the table which is just included after rename. 
  c. Also, if renamed table is excluded from replication policy, then need to 
drop the old table at target as well.

10. Only the initial bootstrap load expects the target DB to be empty but the 
intermediate bootstrap on tables due to regex or inclusion/exclusion list 
change or renames doesn’t expect the target DB or table to be empty. If any 
table with same name exist during such bootstrap, the table will be overwritten 
including data.
{code}


  was:
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).

2. Regular expression can also be supported as 

[jira] [Updated] (HIVE-21761) Support table level replication in Hive

2019-06-04 Thread Sankar Hariappan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-21761:

Description: 
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).

2. Regular expression can also be supported as replication policy. For example,
  a. ., 
  b. .<*suffix>, 
  c. .. 

3. If regular expression is provided as replication policy, then Hive also 
accepts include and exclude lists as input which also helps to dynamically 
add/remove tables for replication.
  a. Exclude list specifies the tables to be excluded even if it satisfies the 
regular expression. 
  b. Include list specifies the tables to be included in addition to the tables 
satisfying the regular expression. 

4. New format for the Replication policy have 3 parts all separated with Dot 
(.).
  a. First part is DB name.
  b. Second part is included list. Comma separated table names/regex with in 
square brackets[].
  c. Third part is excluded list. Comma separated table names/regex with in 
square brackets[].
--- Full DB replication
- .*-- Full DB replication
- .[t1, t3]  -- DB replication with static list of tables t1 and 
t3 included.
- .[t1*, t2].[t100] -- DB replication with all tables having 
prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
t100 which has the prefix t1.

5. If the DB property “repl.source.for” is set, then by default all the tables 
in the DB will be enabled for replication and will continue to archive deleted 
data to CM path.

6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
  a. REPL DUMP  [REPLACE  FROM 
 WITH ;
current_repl_policy and previous_repl_policy can be any format mentioned in 
Point-4.
  b. REPLACE clause to be supported to take previous repl policy as input. 
  c. Rest of the format remains same.

7. Now, REPL DUMP on this DB will replicate the tables based on 
current_repl_policy.

8. Single table replication of format .t1 doesn’t allow changing the 
policy dynamically. So REPLACE clause is not allowed if previous_repl_policy of 
this format.

9. If any table is added dynamically either due to change in regular expression 
or added to include list should be bootstrapped. 
  a. Hive will automatically figure out the list of tables newly included in 
the list by comparing the current_repl_policy & previous_repl_policy inputs and 
combine bootstrap dump for added tables as part of incremental dump. As we can 
combine first incremental with bootstrap dump, it removes the current 
limitation of target DB being inconsistent after bootstrap unless we run first 
incremental replication.
  b. If any table is renamed, then it may gets dynamically added/removed for 
replication based on defined replication policy + include/exclude list. So, 
Hive will perform bootstrap for the table which is just included after rename. 
  c. Also, if renamed table is excluded from replication policy, then need to 
drop the old table at target as well.

10. Only the initial bootstrap load expects the target DB to be empty but the 
intermediate bootstrap on tables due to regex or inclusion/exclusion list 
change or renames doesn’t expect the target DB or table to be empty. If any 
table with same name exist during such bootstrap, the table will be overwritten 
including data.
{code}


  was:
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).

2. Regular expression can also be supported as replication 

[jira] [Updated] (HIVE-21761) Support table level replication in Hive

2019-05-21 Thread Sankar Hariappan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-21761:

Description: 
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).

2. Regular expression can also be supported as replication policy. For example,
  a. ., 
  b. .<*suffix>, 
  c. .. 

3. If regular expression is provided as replication policy, then Hive also 
accepts include and exclude lists as input which also helps to dynamically 
add/remove tables for replication.
  a. Exclude list specifies the tables to be excluded even if it satisfies the 
regular expression. 
  b. Include list specifies the tables to be included in addition to the tables 
satisfying the regular expression. 

4. New format for the Replication policy have 3 parts all separated with Dot 
(.).
  a. First part is DB name.
  b. Second part is included list. Comma separated table names/regex with in 
square brackets[].
  c. Third part is excluded list. Comma separated table names/regex with in 
square brackets[].
--- Full DB replication
- .*-- Full DB replication
- .[t1, t3]  -- DB replication with static list of tables t1 and 
t3 included.
- .[t1*, t2].[t100] -- DB replication with all tables having 
prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
t100 which has the prefix t1.

5. If the DB property “repl.source.for” is set, then by default all the tables 
in the DB will be enabled for replication and will continue to archive deleted 
data to CM path.

6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
  a. REPL DUMP  [REPLACE  FROM 
 WITH ;
current_repl_policy and previous_repl_policy can be any format mentioned in 
Point-4.
  b. REPLACE clause to be supported to take previous repl policy as input. 
  c. Rest of the format remains same.

7. Now, REPL DUMP on this DB will replicate the tables based on 
current_repl_policy.

8. If any table is added dynamically either due to change in regular expression 
or added to include list should be bootstrapped. 
  a. Hive will automatically figure out the list of tables newly included in 
the list by comparing the current_repl_policy & previous_repl_policy inputs and 
combine bootstrap dump for added tables as part of incremental dump. As we can 
combine first incremental with bootstrap dump, it removes the current 
limitation of target DB being inconsistent after bootstrap unless we run first 
incremental replication.
  b. If any table is renamed, then it may gets dynamically added/removed for 
replication based on defined replication policy + include/exclude list. So, 
Hive will perform bootstrap for the table which is just included after rename. 
  c. Also, if renamed table is excluded from replication policy, then need to 
drop the old table at target as well.

9. Only the initial bootstrap load expects the target DB to be empty but the 
intermediate bootstrap on tables due to regex or inclusion/exclusion list 
change or renames doesn’t expect the target DB or table to be empty. If any 
table with same name exist during such bootstrap, the table will be overwritten 
including data.
{code}


  was:
*Requirements:*
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.

*Design:*
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).
2. Regular expression can also be supported as replication policy. For example,
a. ., 
b. .<*suffix>, 
c. .. 
3. If regular expression is provided as replication policy, then Hive also 
accepts include and exclude lists as input which also helps to 

[jira] [Updated] (HIVE-21761) Support table level replication in Hive

2019-05-21 Thread Sankar Hariappan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-21761:

Affects Version/s: (was: 4.0.0)

> Support table level replication in Hive
> ---
>
> Key: HIVE-21761
> URL: https://issues.apache.org/jira/browse/HIVE-21761
> Project: Hive
>  Issue Type: New Feature
>  Components: repl
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>Priority: Major
>  Labels: DR, Replication
>
> *Requirements:*
> {code}
> - User needs to define replication policy to replicate any specific table. 
> This enables user to replicate only the business critical tables instead of 
> replicating all tables which may throttle the network bandwidth, storage and 
> also slow-down Hive replication.
> - User needs to define replication policy using regular expressions (such as 
> db.sales_*) and needs to include additional tables which are non-matching 
> given pattern and exclude some tables which are matching given pattern.
> - User needs to dynamically add/remove tables to the list either by manually 
> changing the replication policy during run time.
> {code}
> *Design:*
> {code}
> 1. Hive continue to support DB level replication policy of format .* 
> but logically, we support the policy as .(t1, t3, …).
> 2. Regular expression can also be supported as replication policy. For 
> example,
>   a. ., 
>   b. .<*suffix>, 
>   c. .. 
> 3. If regular expression is provided as replication policy, then Hive also 
> accepts include and exclude lists as input which also helps to dynamically 
> add/remove tables for replication.
>   a. Exclude list specifies the tables to be excluded even if it satisfies 
> the regular expression. 
>   b. Include list specifies the tables to be included in addition to the 
> tables satisfying the regular expression. 
> 4. New format for the Replication policy have 3 parts all separated with Dot 
> (.).
>   a. First part is DB name.
>   b. Second part is included list. Comma separated table names/regex with in 
> square brackets[].
>   c. Third part is excluded list. Comma separated table names/regex with in 
> square brackets[].
> --- Full DB replication
> - .*-- Full DB replication
> - .[t1, t3]  -- DB replication with static list of tables t1 and 
> t3 included.
> - .[t1*, t2].[t100] -- DB replication with all tables having 
> prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude 
> t100 which has the prefix t1.
> 5. If the DB property “repl.source.for” is set, then by default all the 
> tables in the DB will be enabled for replication and will continue to archive 
> deleted data to CM path.
> 6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
>   a. REPL DUMP  [REPLACE  FROM 
>  WITH ;
> current_repl_policy and previous_repl_policy can be any format mentioned in 
> Point-4.
>   b. REPLACE clause to be supported to take previous repl policy as input. 
>   c. Rest of the format remains same.
> 7. Now, REPL DUMP on this DB will replicate the tables based on 
> current_repl_policy.
> 8. If any table is added dynamically either due to change in regular 
> expression or added to include list should be bootstrapped. 
>   a. Hive will automatically figure out the list of tables newly included in 
> the list by comparing the current_repl_policy & previous_repl_policy inputs 
> and combine bootstrap dump for added tables as part of incremental dump. As 
> we can combine first incremental with bootstrap dump, it removes the current 
> limitation of target DB being inconsistent after bootstrap unless we run 
> first incremental replication.
>   b. If any table is renamed, then it may gets dynamically added/removed for 
> replication based on defined replication policy + include/exclude list. So, 
> Hive will perform bootstrap for the table which is just included after 
> rename. 
>   c. Also, if renamed table is excluded from replication policy, then need to 
> drop the old table at target as well.
> 9. Only the initial bootstrap load expects the target DB to be empty but the 
> intermediate bootstrap on tables due to regex or inclusion/exclusion list 
> change or renames doesn’t expect the target DB or table to be empty. If any 
> table with same name exist during such bootstrap, the table will be 
> overwritten including data.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21761) Support table level replication in Hive

2019-05-21 Thread Sankar Hariappan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sankar Hariappan updated HIVE-21761:

Description: 
*Requirements:*
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.

*Design:*
1. Hive continue to support DB level replication policy of format .* 
but logically, we support the policy as .(t1, t3, …).
2. Regular expression can also be supported as replication policy. For example,
a. ., 
b. .<*suffix>, 
c. .. 
3. If regular expression is provided as replication policy, then Hive also 
accepts include and exclude lists as input which also helps to dynamically 
add/remove tables for replication.
a. Exclude list specifies the tables to be excluded even if it satisfies the 
regular expression. 
b. Include list specifies the tables to be included in addition to the tables 
satisfying the regular expression. 
4. New format for the Replication policy have 3 parts all separated with Dot 
(.).
a. First part is DB name.
b. Second part is included list. Comma separated table names/regex with in 
square brackets[].
c. Third part is excluded list. Comma separated table names/regex with in 
square brackets[].
--- Full DB replication
- .*-- Full DB replication
- .[t1, t3]  -- DB replication with static list of tables t1 and t3 
included.
- .[t1*, t2].[t100] -- DB replication with all tables having prefix t1 
and also include table t2 which doesn’t have prefix t1 and exclude t100 which 
has the prefix t1.
5. If the DB property “repl.source.for” is set, then by default all the tables 
in the DB will be enabled for replication and will continue to archive deleted 
data to CM path.
6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
a. REPL DUMP  [REPLACE  FROM 
 WITH ;
current_repl_policy and previous_repl_policy can be any format mentioned in 
Point-4.
b. REPLACE clause to be supported to take previous repl policy as input. 
c. Rest of the format remains same.
d. Now, REPL DUMP on this DB will replicate the tables based on 
current_repl_policy.
7. If any table is added dynamically either due to change in regular expression 
or added to include list should be bootstrapped. 
a. Hive will automatically figure out the list of tables newly included in the 
list by comparing the current_repl_policy & previous_repl_policy inputs and 
combine bootstrap dump for added tables as part of incremental dump. As we can 
combine first incremental with bootstrap dump, it removes the current 
limitation of target DB being inconsistent after bootstrap unless we run first 
incremental replication.
b. If any table is renamed, then it may gets dynamically added/removed for 
replication based on defined replication policy + include/exclude list. So, 
Hive will perform bootstrap for the table which is just included after rename. 
c. Also, if renamed table is excluded from replication policy, then need to 
drop the old table at target as well.
8. Only the initial bootstrap load expects the target DB to be empty but the 
intermediate bootstrap on tables due to regex or inclusion/exclusion list 
change or renames doesn’t expect the target DB or table to be empty. If any 
table with same name exist during such bootstrap, the table will be overwritten 
including data.


  was:
*Requirements:*
- User needs to define replication policy to replicate any specific table. This 
enables user to replicate only the business critical tables instead of 
replicating all tables which may throttle the network bandwidth, storage and 
also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as 
db.sales_*) and needs to include additional tables which are non-matching given 
pattern and exclude some tables which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually 
changing the replication policy during run time.


> Support table level replication in Hive
> ---
>
> Key: HIVE-21761
> URL: https://issues.apache.org/jira/browse/HIVE-21761
> Project: Hive
>  Issue Type: New Feature
>  Components: repl
>Affects Versions: 4.0.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>Priority: Major
>  Labels: DR, Replication
>
>