Sorry for the delay. From the logs, it is clear that the stored partition key
and lookup key are not exactly same. One has scheme and authority in its URI
while the other is not. This is the reason why we are updating the same
partition again.
Some of the methods used here comes from hadoop-common and related packages.
With Hadoop 2.7.3, I am NOT able to reproduce this issue locally. I used the
below code to try to repro. Which version of Hadoop are you using in runtime.
Can you check if the stripped path (see test code below) still contains scheme
and authority.
```public void testit() {
Path path = new
Path("s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt"
+ "=20191117\n");
System.out.println("Path is : " + path.toUri().getPath());
System.out.println("Is Absolute :" + path.isUriPathAbsolute());
String stripped =
Path.getPathWithoutSchemeAndAuthority(path).toUri().getPath();
System.out.println("Stripped Path =" + stripped);
}
```
Balaji.V
On Wednesday, February 5, 2020, 12:53:57 AM PST, Purushotham Pushpavanthar
<[email protected]> wrote:
Hi Balaji/Vinoth,
Below is the log we obtained from Hudi.
20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to
be 20200122094611
20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is
20200122094611, Getting commits since then
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20180108, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20180221, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20180102, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191007, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191128, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191127, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191006, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191009, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191129, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191008, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191120, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191122, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191001, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191121, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191124, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191003, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191002, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191123, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191005, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191126, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191125, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191004, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181208, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181207, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181206, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181205, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20180117, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181209, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181204, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181203, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181202, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20181201, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201
20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
StorageVal=20191117, Existing Hive
Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117,
New
Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117
Regards,
Purushotham Pushpavanth
On Tue, 4 Feb 2020 at 05:50, Vinoth Chandar <[email protected]> wrote:
> Unfortunately, the mailing list does not support attachments, looks like :(
> Could you paste it inline?
>
> On Sat, Feb 1, 2020 at 6:20 AM Purushotham Pushpavanthar <
> [email protected]> wrote:
>
> > Hi Balaji,
> >
> > The attachment contains the logs you asked for.
> > However, the only difference between storageValue and
> > fullStoragePartitionPath is *target-base-path*.
> > So if I'm not wrong, the code will be marking all partitions which got
> > UPDATE data for partition update. Hence time consuming.
> >
> > Regards,
> > Purushotham Pushpavanth
> >
> >
> >
> > On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan
> > <[email protected]> wrote:
> >
> >> Hi Purushotham,
> >> I am unable to reproduce same partitions getting hive-synced locally.
> >> Can you add the following log message in HoodieHiveClient.java and run
> the
> >> code and send us logs.
> >> diff --git
> >> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >>
> >> index 4578bb2f..ba4b1147 100644
> >>
> >> --- a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >>
> >> +++ b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> >>
> >> @@ -237,6 +237,8 @@ public class HoodieHiveClient {
> >>
> >> if (!paths.containsKey(storageValue)) {
> >>
> >>
> >> events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
> >>
> >> } else if
> >> (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
> >>
> >> + LOG.info("Partition Location changes. StorageVal=" +
> >> storageValue
> >>
> >> + + ", Existing Hive Path=" + paths.get(storageValue) + ",
> >> New Location=" + fullStoragePartitionPath);
> >>
> >>
> >> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
> >>
> >> }
> >>
> >> }
> >>
> >> THanks,Balaji.V
> >> On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham
> >> Pushpavanthar <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> I noticed that
> >> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is
> time
> >> consuming while running HUDI on set of records which contains data for
> >> large set of partitions. All it is doing is setting location for each
> >> updated partition path. However,
> >> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
> >> *is taking care of adding new partitions to the table.
> >>
> >> 1. For a given table, whose base path doesn't change (usually it
> doesn't
> >> in production), why *updatePartitionsToTable() *is needed? Can you
> >> please throw some light on any such case where this is needed?
> >> 2. If it is required, can we do something to optimise the time
> consumed
> >> by this operation? Currently, the *Alter Statements* are executed one
> by
> >> one on each (partition, path) pair for every updated partition.
> >>
> >>
> >>
> >> Regards,
> >> Purushotham Pushpavanth
> >>
> >
> >
>