Resurrecting this old thread and adding Udit.
Udit,
I am not able to reproduce this issue with HDFS. Are you seeing this pattern
where there are redundant alter-partitions call.
Although not related, I was looking into
https://jira.apache.org/jira/browse/HUDI-325 and am wondering if we are seeing
any discrepancies in hive-syncing between HDFS and non-HDFS clusters.
Balaji.V
On Wednesday, February 19, 2020, 11:36:08 AM PST, [email protected]
<[email protected]> wrote:
Hi Pratyaksh/Purushotham,
I spent some time in the morning trying to reproduce this locally but unable
to. There is an unit-test TestHiveSyncTool.testSyncIncremental which is quite
close to the setup we need to repro.
I added the below check and it passed (meaning works as expected with no
unnecessary update partitions call). Can you use the below code to try
reproducing it locally and in the real ecosystem to see what is happening.
Balaji.V
```System.out.println("DUPLICATE CHECK");
String commitTime3 = "102";
TestUtil.addCOWPartitions(1, true, dateTime, commitTime3);
hiveClient = new HoodieHiveClient(TestUtil.hiveSyncConfig,
TestUtil.getHiveConf(), TestUtil.fileSystem);
writtenPartitionsSince =
hiveClient.getPartitionsWrittenToSince(Option.of(commitTime2));
System.out.println("Added Partitions :" + writtenPartitionsSince);
assertEquals(1, writtenPartitionsSince.size());
hivePartitions =
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName);
partitionEvents = hiveClient.getPartitionEvents(hivePartitions,
writtenPartitionsSince);
assertEquals("No partition events", 0, partitionEvents.size());
tool = new HiveSyncTool(TestUtil.hiveSyncConfig, TestUtil.getHiveConf(),
TestUtil.fileSystem);
tool.syncHoodieTable();
// Sync should add the one partition
assertEquals(6,
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName).size());
assertEquals("The last commit that was sycned should be 102", commitTime3,
hiveClient.getLastCommitTimeSynced(TestUtil.hiveSyncConfig.tableName).get());````
On Wednesday, February 19, 2020, 04:08:39 AM PST, Pratyaksh Sharma
<[email protected]> wrote:
Hi Balaji,
We are using Hadoop 3.1.0.
Here is the output of the function you wanted to see -
Path is : /data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117
Is Absolute :true
Stripped Path
=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117
Stripped path does not contain scheme and authority.
On Mon, Feb 17, 2020 at 2:46 AM Balaji Varadarajan
<[email protected]> wrote:
>
> Sorry for the delay. From the logs, it is clear that the stored partition
> key and lookup key are not exactly same. One has scheme and authority in
> its URI while the other is not. This is the reason why we are updating the
> same partition again.
> Some of the methods used here comes from hadoop-common and related
> packages. With Hadoop 2.7.3, I am NOT able to reproduce this issue locally.
> I used the below code to try to repro. Which version of Hadoop are you
> using in runtime. Can you check if the stripped path (see test code below)
> still contains scheme and authority.
>
> ```public void testit() {
> Path path = new
> Path("s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt"
> + "=20191117\n");
> System.out.println("Path is : " + path.toUri().getPath());
> System.out.println("Is Absolute :" + path.isUriPathAbsolute());
> String stripped =
> Path.getPathWithoutSchemeAndAuthority(path).toUri().getPath();
> System.out.println("Stripped Path =" + stripped);
> }
> ```
> Balaji.V
>
>
> On Wednesday, February 5, 2020, 12:53:57 AM PST, Purushotham
> Pushpavanthar <[email protected]> wrote:
>
> Hi Balaji/Vinoth,
>
> Below is the log we obtained from Hudi.
>
> 20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to
> be 20200122094611
> 20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is
> 20200122094611, Getting commits since then
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180108, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180221, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180102, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191007, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191128, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191127, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191006, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191009, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191129, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191008, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191120, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191122, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191001, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191121, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191124, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191003, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191002, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191123, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191005, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191126, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191125, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191004, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181208, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181207, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181206, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181205, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180117, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181209, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181204, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181203, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181202, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181201, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191117, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117
>
> Regards,
> Purushotham Pushpavanth
>
>
>
> On Tue, 4 Feb 2020 at 05:50, Vinoth Chandar <[email protected]> wrote:
>
> > Unfortunately, the mailing list does not support attachments, looks like
> :(
> > Could you paste it inline?
> >
> > On Sat, Feb 1, 2020 at 6:20 AM Purushotham Pushpavanthar <
> > [email protected]> wrote:
> >
> > > Hi Balaji,
> > >
> > > The attachment contains the logs you asked for.
> > > However, the only difference between storageValue and
> > > fullStoragePartitionPath is *target-base-path*.
> > > So if I'm not wrong, the code will be marking all partitions which got
> > > UPDATE data for partition update. Hence time consuming.
> > >
> > > Regards,
> > > Purushotham Pushpavanth
> > >
> > >
> > >
> > > On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan
> > > <[email protected]> wrote:
> > >
> > >> Hi Purushotham,
> > >> I am unable to reproduce same partitions getting hive-synced locally.
> > >> Can you add the following log message in HoodieHiveClient.java and run
> > the
> > >> code and send us logs.
> > >> diff --git
> > >> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> index 4578bb2f..ba4b1147 100644
> > >>
> > >> ---
> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> +++
> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> @@ -237,6 +237,8 @@ public class HoodieHiveClient {
> > >>
> > >> if (!paths.containsKey(storageValue)) {
> > >>
> > >>
> > >> events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
> > >>
> > >> } else if
> > >> (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
> > >>
> > >> + LOG.info("Partition Location changes. StorageVal=" +
> > >> storageValue
> > >>
> > >> + + ", Existing Hive Path=" + paths.get(storageValue) +
> ",
> > >> New Location=" + fullStoragePartitionPath);
> > >>
> > >>
> > >> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
> > >>
> > >> }
> > >>
> > >> }
> > >>
> > >> THanks,Balaji.V
> > >> On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham
> > >> Pushpavanthar <[email protected]> wrote:
> > >>
> > >> Hi,
> > >>
> > >> I noticed that
> > >> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is
> > time
> > >> consuming while running HUDI on set of records which contains data for
> > >> large set of partitions. All it is doing is setting location for each
> > >> updated partition path. However,
> > >> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
> > >> *is taking care of adding new partitions to the table.
> > >>
> > >> 1. For a given table, whose base path doesn't change (usually it
> > doesn't
> > >> in production), why *updatePartitionsToTable() *is needed? Can you
> > >> please throw some light on any such case where this is needed?
> > >> 2. If it is required, can we do something to optimise the time
> > consumed
> > >> by this operation? Currently, the *Alter Statements* are executed one
> > by
> > >> one on each (partition, path) pair for every updated partition.
> > >>
> > >>
> > >>
> > >> Regards,
> > >> Purushotham Pushpavanth
> > >>
> > >
> > >
> >
>