[ https://issues.apache.org/jira/browse/GOBBLIN-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291121#comment-17291121 ]
Sridivakar commented on GOBBLIN-1395: ------------------------------------- Regarding (1) spurious PostPublishSteps with CREATE TABLE, at [HiveCopyEntityHelper.java#L582|https://github.com/apache/gobblin/blob/0.15.0/gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/hive/HiveCopyEntityHelper.java#L582)] it is found that _addSharedSteps_ is adding unconditionally for every Partition found at source. Regarding (2) {{PostPublishSteps}} with ADD PARTITIONS , at [HivePartitionFileSet.java#L150|https://github.com/apache/gobblin/blob/master/gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/hive/HivePartitionFileSet.java#L150] it is observed that, when there are no files to be copied, it is still going ahead with ADD PARTITION related PostPublishStep and CREATE TABLE related PostPublishStep, instead of returning empty copyEntities. > Spurious PostPublishStep WorkUnits for CREATE TABLE/ADD PARTITIONs for HIVE > table copy > -------------------------------------------------------------------------------------- > > Key: GOBBLIN-1395 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1395 > Project: Apache Gobblin > Issue Type: Bug > Components: hive-registration > Affects Versions: 0.15.0 > Reporter: Sridivakar > Assignee: Abhishek Tiwari > Priority: Minor > > For Hive copy, Gobblin creates spurious {{PostPublishSteps}} for Hive > Registrations : > # Creates so many {{PostPublishSteps with CREATE TABLE. It is observed that > it creates a total of P number of PostPublishSteps}}, for a source table with > P number of partitions. > # Creates {{PostPublishSteps}} with ADD PARTITIONS for already present > partitions at target also, even though it does not CopyableFile work units > for those partitions, which is not required though. > > *Steps to reproduce :* > h5. Step 1) > a) create a table with 5 partitions, with some rows in each partition > {code:sql} > hive> show partitions tc_p5_r10; > OK > dt=2020-12-26 > dt=2020-12-27 > dt=2020-12-28 > dt=2020-12-29 > dt=2020-12-30 > Time taken: 1.287 seconds, Fetched: 5 row(s) > {code} > b) Do DataMovement with the below mentioned Job configuration > c) Observations : > {quote}{{Total No. of old partitions in the table (O) : 0}} > {{Total No. of new partitions in the table (N): 5}} > {{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}} > {{CopyableFile WorkUnits: 5 (one for each partition)}} > {{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in > the table, out of the two: one for publishing table metadata; another for > publishing partition metadata{color})}} > {quote} > h5. Step 2) > a) add 5 more partitions, with some rows in each partition > {code:sql} > hive> show partitions tc_p5_r10; > OK > dt=2020-12-26 > dt=2020-12-27 > dt=2020-12-28 > dt=2020-12-29 > dt=2020-12-30 > dt=2021-01-01 > dt=2021-01-02 > dt=2021-01-03 > dt=2021-01-04 > dt=2021-01-05 > Time taken: 0.131 seconds, Fetched: 10 row(s) > {code} > _Note: there is a missing partition for 31st Dec, intentionally left out for > step (3)_ > b) Do DataMovement with the below Job configuration > c) Observations : > {quote}{{Total No. of old partitions in the table (O): 5}} > {{Total No. of new partitions in the table (N): 5}} > {{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}} > {{CopyableFile WorkUnits: 5 (one for each newly found partition)}} > {{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in > the entire table, not just for new partitions!{color})}} > {quote} > h5. Step 3) > a) At source add the missing partition(2020-12-31) in middle, with some rows > in the partition > {code:sql} > hive> show partitions tc_p5_r10; > OK > dt=2020-12-26 > dt=2020-12-27 > dt=2020-12-28 > dt=2020-12-29 > dt=2020-12-30 > dt=2020-12-31 > dt=2021-01-01 > dt=2021-01-02 > dt=2021-01-03 > dt=2021-01-04 > dt=2021-01-05 > Time taken: 0.101 seconds, Fetched: 11 row(s) > {code} > b) Do DataMovement with the below Job configuration > c) Observations : > {quote}{{Total No. of old partitions in the table (O): 10}}}} > {{Total No. of new partitions in the table (N): 1}} > {{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}} > {{CopyableFile WorkUnits: 1 (one for each newly found partition)}} > {{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in > the entire table, not just for new partition!{color})}} > {quote} > > > h4. +Job Configuration used:+ > {code:java} > job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-* > job.description=Test Gobblin job for copy > # target location for copy > data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data > gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder > source.filebased.fs.uri="hdfs://localhost:8020" > hive.dataset.hive.metastore.uri=thrift://localhost:9083 > hive.dataset.copy.target.table.root=${data.publisher.final.dir} > hive.dataset.copy.target.metastore.uri=thrift://localhost:9083 > hive.dataset.copy.target.database=tc_db_copy_1 > hive.db.root.dir=${data.publisher.final.dir} > # writer.fs.uri="hdfs://127.0.0.1:8020/" > hive.dataset.whitelist=tc_db.tc_p5_r10 > gobblin.copy.recursive.update=true > # ==================================================================== > # Distcp configurations (do not change) > # ==================================================================== > type=hadoopJava > job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher > extract.namespace=org.apache.gobblin.copy > > data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher > source.class=org.apache.gobblin.data.management.copy.CopySource > > writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder > converter.classes=org.apache.gobblin.converter.IdentityConverter > task.maxretries=0 > workunit.retry.enabled=false > distcp.persist.dir=/tmp/distcp-persist-dir{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)