[jira] [Updated] (HIVE-20911) External Table Replication for Hive

anishek (JIRA) Thu, 15 Nov 2018 01:54:02 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


anishek updated HIVE-20911:
---------------------------
    Description: 
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
** *repl load* will read the  *\_external\_tables\_info* to identify what 
locations are to be copied from source to target and create corresponding tasks 
for them.
* New External tables will be created with metadata only with no data copied as 
part of regular tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be 
used to copy data from source to target  as part of boostrap load.
* Since bootstrap load will most probably create a DAG to that can use 
parallelism in the execution phase, the hdfs copy related tasks are created 
only once the bootstrap phase is complete.
* Since incremental load results in a DAG with only sequential execution ( 
events applied in sequence ) to effectively use the parallelism capability in 
execution mode we created tasks for hdfs copy along with the incremental 
related DAG, This requires a few basic calculations to approximately meet the 
configured value in  "hive.repl.approx.max.load.tasks" 

  was:
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
** *repl load* will read the  *\_external\_tables\_info* to identify what 
locations are to be copied from source to target and create corresponding tasks 
for them.
* New External tables will be created with metadata only with no data copied as 
part of regular tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be 
used to copy data from source to target  as part of boostrap load.


> External Table Replication for Hive
> -----------------------------------
>
>                 Key: HIVE-20911
>                 URL: https://issues.apache.org/jira/browse/HIVE-20911
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>    Affects Versions: 4.0.0
>            Reporter: anishek
>            Assignee: anishek
>            Priority: Critical
>             Fix For: 4.0.0
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Since bootstrap load will most probably create a DAG to that can use 
> parallelism in the execution phase, the hdfs copy related tasks are created 
> only once the bootstrap phase is complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode we created tasks for hdfs copy along with the incremental 
> related DAG, This requires a few basic calculations to approximately meet the 
> configured value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20911) External Table Replication for Hive

Reply via email to