Hi Balaji, We are using Hadoop 3.1.0.
Here is the output of the function you wanted to see - Path is : /data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117 Is Absolute :true Stripped Path =/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117 Stripped path does not contain scheme and authority. On Mon, Feb 17, 2020 at 2:46 AM Balaji Varadarajan <[email protected]> wrote: > > Sorry for the delay. From the logs, it is clear that the stored partition > key and lookup key are not exactly same. One has scheme and authority in > its URI while the other is not. This is the reason why we are updating the > same partition again. > Some of the methods used here comes from hadoop-common and related > packages. With Hadoop 2.7.3, I am NOT able to reproduce this issue locally. > I used the below code to try to repro. Which version of Hadoop are you > using in runtime. Can you check if the stripped path (see test code below) > still contains scheme and authority. > > ```public void testit() { > Path path = new > Path("s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt" > + "=20191117\n"); > System.out.println("Path is : " + path.toUri().getPath()); > System.out.println("Is Absolute :" + path.isUriPathAbsolute()); > String stripped = > Path.getPathWithoutSchemeAndAuthority(path).toUri().getPath(); > System.out.println("Stripped Path =" + stripped); > } > ``` > Balaji.V > > > On Wednesday, February 5, 2020, 12:53:57 AM PST, Purushotham > Pushpavanthar <[email protected]> wrote: > > Hi Balaji/Vinoth, > > Below is the log we obtained from Hudi. > > 20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to > be 20200122094611 > 20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is > 20200122094611, Getting commits since then > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20180108, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20180221, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20180102, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191007, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191128, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191127, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191006, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191009, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191129, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191008, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191120, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191122, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191001, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191121, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191124, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191003, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191002, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191123, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191005, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191126, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191125, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191004, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20181208, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20181207, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20181206, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20181205, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20180117, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20181209, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20181204, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20181203, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20181202, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20181201, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201 > 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes. > StorageVal=20191117, Existing Hive > Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117, > New > > Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117 > > Regards, > Purushotham Pushpavanth > > > > On Tue, 4 Feb 2020 at 05:50, Vinoth Chandar <[email protected]> wrote: > > > Unfortunately, the mailing list does not support attachments, looks like > :( > > Could you paste it inline? > > > > On Sat, Feb 1, 2020 at 6:20 AM Purushotham Pushpavanthar < > > [email protected]> wrote: > > > > > Hi Balaji, > > > > > > The attachment contains the logs you asked for. > > > However, the only difference between storageValue and > > > fullStoragePartitionPath is *target-base-path*. > > > So if I'm not wrong, the code will be marking all partitions which got > > > UPDATE data for partition update. Hence time consuming. > > > > > > Regards, > > > Purushotham Pushpavanth > > > > > > > > > > > > On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan > > > <[email protected]> wrote: > > > > > >> Hi Purushotham, > > >> I am unable to reproduce same partitions getting hive-synced locally. > > >> Can you add the following log message in HoodieHiveClient.java and run > > the > > >> code and send us logs. > > >> diff --git > > >> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java > > >> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java > > >> > > >> index 4578bb2f..ba4b1147 100644 > > >> > > >> --- > a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java > > >> > > >> +++ > b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java > > >> > > >> @@ -237,6 +237,8 @@ public class HoodieHiveClient { > > >> > > >> if (!paths.containsKey(storageValue)) { > > >> > > >> > > >> events.add(PartitionEvent.newPartitionAddEvent(storagePartition)); > > >> > > >> } else if > > >> (!paths.get(storageValue).equals(fullStoragePartitionPath)) { > > >> > > >> + LOG.info("Partition Location changes. StorageVal=" + > > >> storageValue > > >> > > >> + + ", Existing Hive Path=" + paths.get(storageValue) + > ", > > >> New Location=" + fullStoragePartitionPath); > > >> > > >> > > >> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition)); > > >> > > >> } > > >> > > >> } > > >> > > >> THanks,Balaji.V > > >> On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham > > >> Pushpavanthar <[email protected]> wrote: > > >> > > >> Hi, > > >> > > >> I noticed that > > >> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is > > time > > >> consuming while running HUDI on set of records which contains data for > > >> large set of partitions. All it is doing is setting location for each > > >> updated partition path. However, > > >> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable() > > >> *is taking care of adding new partitions to the table. > > >> > > >> 1. For a given table, whose base path doesn't change (usually it > > doesn't > > >> in production), why *updatePartitionsToTable() *is needed? Can you > > >> please throw some light on any such case where this is needed? > > >> 2. If it is required, can we do something to optimise the time > > consumed > > >> by this operation? Currently, the *Alter Statements* are executed one > > by > > >> one on each (partition, path) pair for every updated partition. > > >> > > >> > > >> > > >> Regards, > > >> Purushotham Pushpavanth > > >> > > > > > > > > >
