[ 
https://issues.apache.org/jira/browse/GOBBLIN-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291121#comment-17291121
 ] 

Sridivakar commented on GOBBLIN-1395:
-------------------------------------

Regarding (1) spurious PostPublishSteps with CREATE TABLE, at 
[HiveCopyEntityHelper.java#L582|https://github.com/apache/gobblin/blob/0.15.0/gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/hive/HiveCopyEntityHelper.java#L582)]
  it is found that _addSharedSteps_ is adding unconditionally for every 
Partition found at source.

 

Regarding (2) {{PostPublishSteps}} with ADD PARTITIONS , at 
[HivePartitionFileSet.java#L150|https://github.com/apache/gobblin/blob/master/gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/hive/HivePartitionFileSet.java#L150]
 it is observed that, when there are no files to be copied, it is still going 
ahead with ADD PARTITION related PostPublishStep and CREATE TABLE related 
PostPublishStep, instead of returning empty copyEntities.

> Spurious PostPublishStep WorkUnits for CREATE TABLE/ADD PARTITIONs for HIVE 
> table copy
> --------------------------------------------------------------------------------------
>
>                 Key: GOBBLIN-1395
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1395
>             Project: Apache Gobblin
>          Issue Type: Bug
>          Components: hive-registration
>    Affects Versions: 0.15.0
>            Reporter: Sridivakar
>            Assignee: Abhishek Tiwari
>            Priority: Minor
>
> For Hive copy, Gobblin creates spurious {{PostPublishSteps}} for Hive 
> Registrations :
>  # Creates so many {{PostPublishSteps with CREATE TABLE. It is observed that 
> it creates a total of P number of PostPublishSteps}}, for a source table with 
> P number of partitions.
>  # Creates {{PostPublishSteps}} with ADD PARTITIONS for already present 
> partitions at target also, even though it does not CopyableFile work units 
> for those partitions, which is not required though.
>  
> *Steps to reproduce :*
> h5. Step 1)
> a) create a table with 5 partitions, with some rows in each partition
> {code:sql}
> hive> show partitions tc_p5_r10;
>  OK
>  dt=2020-12-26
>  dt=2020-12-27
>  dt=2020-12-28
>  dt=2020-12-29
>  dt=2020-12-30
>  Time taken: 1.287 seconds, Fetched: 5 row(s)
> {code}
> b) Do DataMovement with the below mentioned Job configuration
>  c) Observations :
> {quote}{{Total No. of old partitions in the table (O) : 0}}
>  {{Total No. of new partitions in the table (N): 5}}
>  {{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}}
>  {{CopyableFile WorkUnits: 5 (one for each partition)}}
>  {{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in 
> the table, out of the two: one for publishing table metadata; another for 
> publishing partition metadata{color})}}
> {quote}
> h5. Step 2)
> a) add 5 more partitions, with some rows in each partition 
> {code:sql}
> hive> show partitions tc_p5_r10;
>  OK
>  dt=2020-12-26
>  dt=2020-12-27
>  dt=2020-12-28
>  dt=2020-12-29
>  dt=2020-12-30
>  dt=2021-01-01
>  dt=2021-01-02
>  dt=2021-01-03
>  dt=2021-01-04
>  dt=2021-01-05
>  Time taken: 0.131 seconds, Fetched: 10 row(s)
> {code}
>  _Note: there is a missing partition for 31st Dec, intentionally left out for 
> step (3)_
> b) Do DataMovement with the below Job configuration
>  c) Observations :
> {quote}{{Total No. of old partitions in the table (O): 5}}
>  {{Total No. of new partitions in the table (N): 5}}
>  {{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}}
>  {{CopyableFile WorkUnits: 5 (one for each newly found partition)}}
>  {{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in 
> the entire table, not just for new partitions!{color})}}
> {quote}
> h5. Step 3)
> a) At source add the missing partition(2020-12-31) in middle, with some rows 
> in the partition
> {code:sql}
> hive> show partitions tc_p5_r10;
>  OK
>  dt=2020-12-26
>  dt=2020-12-27
>  dt=2020-12-28
>  dt=2020-12-29
>  dt=2020-12-30
>  dt=2020-12-31
>  dt=2021-01-01
>  dt=2021-01-02
>  dt=2021-01-03
>  dt=2021-01-04
>  dt=2021-01-05
>  Time taken: 0.101 seconds, Fetched: 11 row(s)
> {code}
> b) Do DataMovement with the below Job configuration
>  c) Observations :
> {quote}{{Total No. of old partitions in the table (O): 10}}}}
>  {{Total No. of new partitions in the table (N): 1}}
>  {{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}}
>  {{CopyableFile WorkUnits: 1 (one for each newly found partition)}}
>  {{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in 
> the entire table, not just for new partition!{color})}}
> {quote}
>  
>  
> h4. +Job Configuration used:+
> {code:java}
> job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-*
> job.description=Test Gobblin job for copy
> # target location for copy
> data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data
> gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder
> source.filebased.fs.uri="hdfs://localhost:8020"
> hive.dataset.hive.metastore.uri=thrift://localhost:9083
> hive.dataset.copy.target.table.root=${data.publisher.final.dir}
> hive.dataset.copy.target.metastore.uri=thrift://localhost:9083
> hive.dataset.copy.target.database=tc_db_copy_1
> hive.db.root.dir=${data.publisher.final.dir}
> # writer.fs.uri="hdfs://127.0.0.1:8020/"
> hive.dataset.whitelist=tc_db.tc_p5_r10
> gobblin.copy.recursive.update=true
> # ====================================================================
> # Distcp configurations (do not change)
> # ====================================================================
> type=hadoopJava
> job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher
> extract.namespace=org.apache.gobblin.copy
>  
> data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher
> source.class=org.apache.gobblin.data.management.copy.CopySource
>  
> writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder
> converter.classes=org.apache.gobblin.converter.IdentityConverter
> task.maxretries=0
> workunit.retry.enabled=false
> distcp.persist.dir=/tmp/distcp-persist-dir{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to