[ https://issues.apache.org/jira/browse/GOBBLIN-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sridivakar updated GOBBLIN-1395: -------------------------------- Description: For Hive copy, Gobblin creates spurious {{PostPublishSteps}} for Hive Registrations : # Creates so many {{PostPublishSteps with CREATE TABLE. It is observed that it creates a total of P number of PostPublishSteps}}, for a source table with P number of partitions. # Creates {{PostPublishSteps}} with ADD PARTITIONS for already present partitions at target also, even though it does not CopyableFile work units for those partitions, which is not required though. *Steps to reproduce :* h5. Step 1) a) create a table with 5 partitions, with some rows in each partition {code:sql} hive> show partitions tc_p5_r10; OK dt=2020-12-26 dt=2020-12-27 dt=2020-12-28 dt=2020-12-29 dt=2020-12-30 Time taken: 1.287 seconds, Fetched: 5 row(s) {code} b) Do DataMovement with the below mentioned Job configuration c) Observations : {quote}{{Total No. of old partitions in the table (O) : 0}} {{Total No. of new partitions in the table (N): 5}} {{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}} {{CopyableFile WorkUnits: 5 (one for each partition)}} {{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata{color})}} {quote} h5. Step 2) a) add 5 more partitions, with some rows in each partition {code:sql} hive> show partitions tc_p5_r10; OK dt=2020-12-26 dt=2020-12-27 dt=2020-12-28 dt=2020-12-29 dt=2020-12-30 dt=2021-01-01 dt=2021-01-02 dt=2021-01-03 dt=2021-01-04 dt=2021-01-05 Time taken: 0.131 seconds, Fetched: 10 row(s) {code} _Note: there is a missing partition for 31st Dec, intentionally left out for step (3)_ b) Do DataMovement with the below Job configuration c) Observations : {quote}{{Total No. of old partitions in the table (O): 5}} {{Total No. of new partitions in the table (N): 5}} {{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}} {{CopyableFile WorkUnits: 5 (one for each newly found partition)}} {{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in the entire table, not just for new partitions!{color})}} {quote} h5. Step 3) a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition {code:sql} hive> show partitions tc_p5_r10; OK dt=2020-12-26 dt=2020-12-27 dt=2020-12-28 dt=2020-12-29 dt=2020-12-30 dt=2020-12-31 dt=2021-01-01 dt=2021-01-02 dt=2021-01-03 dt=2021-01-04 dt=2021-01-05 Time taken: 0.101 seconds, Fetched: 11 row(s) {code} b) Do DataMovement with the below Job configuration c) Observations : {quote}{{Total No. of old partitions in the table (O): 10}}}} {{Total No. of new partitions in the table (N): 1}} {{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}} {{CopyableFile WorkUnits: 1 (one for each newly found partition)}} {{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in the entire table, not just for new partition!{color})}} {quote} h4. +Job Configuration used:+ {code:java} job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-* job.description=Test Gobblin job for copy # target location for copy data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder source.filebased.fs.uri="hdfs://localhost:8020" hive.dataset.hive.metastore.uri=thrift://localhost:9083 hive.dataset.copy.target.table.root=${data.publisher.final.dir} hive.dataset.copy.target.metastore.uri=thrift://localhost:9083 hive.dataset.copy.target.database=tc_db_copy_1 hive.db.root.dir=${data.publisher.final.dir} # writer.fs.uri="hdfs://127.0.0.1:8020/" hive.dataset.whitelist=tc_db.tc_p5_r10 gobblin.copy.recursive.update=true # ==================================================================== # Distcp configurations (do not change) # ==================================================================== type=hadoopJava job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher extract.namespace=org.apache.gobblin.copy data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher source.class=org.apache.gobblin.data.management.copy.CopySource writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder converter.classes=org.apache.gobblin.converter.IdentityConverter task.maxretries=0 workunit.retry.enabled=false distcp.persist.dir=/tmp/distcp-persist-dir{code} was: For Hive copy, Gobblin creates spurious {{PostPublishSteps}} for Hive Registrations : # Creates so many {{PostPublishSteps with CREATE TABLE. It is observed that it creates a total of P number of PostPublishSteps}}, for a source table with P number of partitions. # Creates {{PostPublishSteps}} with ADD PARTITIONS for already present partitions at target also, even though it does not CopyableFile work units for those partitions. *Steps to reproduce :* h5. Step 1) a) create a table with 5 partitions, with some rows in each partition {code:sql} hive> show partitions tc_p5_r10; OK dt=2020-12-26 dt=2020-12-27 dt=2020-12-28 dt=2020-12-29 dt=2020-12-30 Time taken: 1.287 seconds, Fetched: 5 row(s) {code} b) Do DataMovement with the below mentioned Job configuration c) Observations : {quote}{{Total No. of old partitions in the table (O) : 0}} {{Total No. of new partitions in the table (N): 5}} {{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}} {{CopyableFile WorkUnits: 5 (one for each partition)}} {{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in the table, out of the two: one for publishing table metadata; another for publishing partition metadata{color})}} {quote} h5. Step 2) a) add 5 more partitions, with some rows in each partition {code:sql} hive> show partitions tc_p5_r10; OK dt=2020-12-26 dt=2020-12-27 dt=2020-12-28 dt=2020-12-29 dt=2020-12-30 dt=2021-01-01 dt=2021-01-02 dt=2021-01-03 dt=2021-01-04 dt=2021-01-05 Time taken: 0.131 seconds, Fetched: 10 row(s) {code} _Note: there is a missing partition for 31st Dec, intentionally left out for step (3)_ b) Do DataMovement with the below Job configuration c) Observations : {quote}{{Total No. of old partitions in the table (O): 5}} {{Total No. of new partitions in the table (N): 5}} {{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}} {{CopyableFile WorkUnits: 5 (one for each newly found partition)}} {{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in the entire table, not just for new partitions!{color})}} {quote} h5. Step 3) a) At source add the missing partition(2020-12-31) in middle, with some rows in the partition {code:sql} hive> show partitions tc_p5_r10; OK dt=2020-12-26 dt=2020-12-27 dt=2020-12-28 dt=2020-12-29 dt=2020-12-30 dt=2020-12-31 dt=2021-01-01 dt=2021-01-02 dt=2021-01-03 dt=2021-01-04 dt=2021-01-05 Time taken: 0.101 seconds, Fetched: 11 row(s) {code} b) Do DataMovement with the below Job configuration c) Observations : {quote}{{Total No. of old partitions in the table (O): 10}}}} {{Total No. of new partitions in the table (N): 1}} {{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}} {{CopyableFile WorkUnits: 1 (one for each newly found partition)}} {{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in the entire table, not just for new partition!{color})}} {quote} h4. +Job Configuration used:+ {code:java} job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-* job.description=Test Gobblin job for copy # target location for copy data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder source.filebased.fs.uri="hdfs://localhost:8020" hive.dataset.hive.metastore.uri=thrift://localhost:9083 hive.dataset.copy.target.table.root=${data.publisher.final.dir} hive.dataset.copy.target.metastore.uri=thrift://localhost:9083 hive.dataset.copy.target.database=tc_db_copy_1 hive.db.root.dir=${data.publisher.final.dir} # writer.fs.uri="hdfs://127.0.0.1:8020/" hive.dataset.whitelist=tc_db.tc_p5_r10 gobblin.copy.recursive.update=true # ==================================================================== # Distcp configurations (do not change) # ==================================================================== type=hadoopJava job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher extract.namespace=org.apache.gobblin.copy data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher source.class=org.apache.gobblin.data.management.copy.CopySource writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder converter.classes=org.apache.gobblin.converter.IdentityConverter task.maxretries=0 workunit.retry.enabled=false distcp.persist.dir=/tmp/distcp-persist-dir{code} > Spurious PostPublishStep WorkUnits for CREATE TABLE/ADD PARTITIONs for HIVE > table copy > -------------------------------------------------------------------------------------- > > Key: GOBBLIN-1395 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1395 > Project: Apache Gobblin > Issue Type: Bug > Components: hive-registration > Affects Versions: 0.15.0 > Reporter: Sridivakar > Assignee: Abhishek Tiwari > Priority: Minor > > For Hive copy, Gobblin creates spurious {{PostPublishSteps}} for Hive > Registrations : > # Creates so many {{PostPublishSteps with CREATE TABLE. It is observed that > it creates a total of P number of PostPublishSteps}}, for a source table with > P number of partitions. > # Creates {{PostPublishSteps}} with ADD PARTITIONS for already present > partitions at target also, even though it does not CopyableFile work units > for those partitions, which is not required though. > > *Steps to reproduce :* > h5. Step 1) > a) create a table with 5 partitions, with some rows in each partition > {code:sql} > hive> show partitions tc_p5_r10; > OK > dt=2020-12-26 > dt=2020-12-27 > dt=2020-12-28 > dt=2020-12-29 > dt=2020-12-30 > Time taken: 1.287 seconds, Fetched: 5 row(s) > {code} > b) Do DataMovement with the below mentioned Job configuration > c) Observations : > {quote}{{Total No. of old partitions in the table (O) : 0}} > {{Total No. of new partitions in the table (N): 5}} > {{*Total WorkUnits created (W): 15 ( 2 x (O+N) + N )*}} > {{CopyableFile WorkUnits: 5 (one for each partition)}} > {{PostPublishStep WorkUnits: 10 ({color:#de350b}two for each partition in > the table, out of the two: one for publishing table metadata; another for > publishing partition metadata{color})}} > {quote} > h5. Step 2) > a) add 5 more partitions, with some rows in each partition > {code:sql} > hive> show partitions tc_p5_r10; > OK > dt=2020-12-26 > dt=2020-12-27 > dt=2020-12-28 > dt=2020-12-29 > dt=2020-12-30 > dt=2021-01-01 > dt=2021-01-02 > dt=2021-01-03 > dt=2021-01-04 > dt=2021-01-05 > Time taken: 0.131 seconds, Fetched: 10 row(s) > {code} > _Note: there is a missing partition for 31st Dec, intentionally left out for > step (3)_ > b) Do DataMovement with the below Job configuration > c) Observations : > {quote}{{Total No. of old partitions in the table (O): 5}} > {{Total No. of new partitions in the table (N): 5}} > {{Total WorkUnits created (W) : 25 ( 2 x (O+N) + N )}} > {{CopyableFile WorkUnits: 5 (one for each newly found partition)}} > {{PostPublishStep WorkUnits: 20 ({color:#de350b}two for every partition in > the entire table, not just for new partitions!{color})}} > {quote} > h5. Step 3) > a) At source add the missing partition(2020-12-31) in middle, with some rows > in the partition > {code:sql} > hive> show partitions tc_p5_r10; > OK > dt=2020-12-26 > dt=2020-12-27 > dt=2020-12-28 > dt=2020-12-29 > dt=2020-12-30 > dt=2020-12-31 > dt=2021-01-01 > dt=2021-01-02 > dt=2021-01-03 > dt=2021-01-04 > dt=2021-01-05 > Time taken: 0.101 seconds, Fetched: 11 row(s) > {code} > b) Do DataMovement with the below Job configuration > c) Observations : > {quote}{{Total No. of old partitions in the table (O): 10}}}} > {{Total No. of new partitions in the table (N): 1}} > {{*Total WorkUnits created (W): 23 ( 2 x (O+N) + N )*}} > {{CopyableFile WorkUnits: 1 (one for each newly found partition)}} > {{PostPublishStep WorkUnits: 22 ({color:#de350b}two for every partition in > the entire table, not just for new partition!{color})}} > {quote} > > > h4. +Job Configuration used:+ > {code:java} > job.name=LocalHive2LocalHive-tc_db-tc_p5_r10-* > job.description=Test Gobblin job for copy > # target location for copy > data.publisher.final.dir=/tmp/hive/tc_db_1_copy/tc_p5_r10/data > gobblin.dataset.profile.class=org.apache.gobblin.data.management.copy.hive.HiveDatasetFinder > source.filebased.fs.uri="hdfs://localhost:8020" > hive.dataset.hive.metastore.uri=thrift://localhost:9083 > hive.dataset.copy.target.table.root=${data.publisher.final.dir} > hive.dataset.copy.target.metastore.uri=thrift://localhost:9083 > hive.dataset.copy.target.database=tc_db_copy_1 > hive.db.root.dir=${data.publisher.final.dir} > # writer.fs.uri="hdfs://127.0.0.1:8020/" > hive.dataset.whitelist=tc_db.tc_p5_r10 > gobblin.copy.recursive.update=true > # ==================================================================== > # Distcp configurations (do not change) > # ==================================================================== > type=hadoopJava > job.class=org.apache.gobblin.azkaban.AzkabanJobLauncher > extract.namespace=org.apache.gobblin.copy > > data.publisher.type=org.apache.gobblin.data.management.copy.publisher.CopyDataPublisher > source.class=org.apache.gobblin.data.management.copy.CopySource > > writer.builder.class=org.apache.gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder > converter.classes=org.apache.gobblin.converter.IdentityConverter > task.maxretries=0 > workunit.retry.enabled=false > distcp.persist.dir=/tmp/distcp-persist-dir{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)