Hi Balaji,

We are using Hadoop 3.1.0.

Here is the output of the function you wanted to see -

Path is : /data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Is Absolute :true
Stripped Path
=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Stripped path does not contain scheme and authority.

On Mon, Feb 17, 2020 at 2:46 AM Balaji Varadarajan
<[email protected]> wrote:

>
> Sorry for the delay. From the logs, it is clear that the stored partition
> key and  lookup key are not exactly same. One has scheme and authority in
> its URI while the other is not. This is the reason why we are updating the
> same partition again.
> Some of the methods used here comes from hadoop-common and related
> packages. With Hadoop 2.7.3, I am NOT able to reproduce this issue locally.
> I used the below code to try to repro. Which version of Hadoop are you
> using in runtime. Can you  check if the stripped path (see test code below)
> still contains scheme and authority.
>
> ```public void testit() {
>     Path path = new
> Path("s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt"
>         + "=20191117\n");
>     System.out.println("Path is : " + path.toUri().getPath());
>     System.out.println("Is Absolute :" + path.isUriPathAbsolute());
>     String stripped =
> Path.getPathWithoutSchemeAndAuthority(path).toUri().getPath();
>     System.out.println("Stripped Path =" + stripped);
> }
> ```
> Balaji.V
>
>
>     On Wednesday, February 5, 2020, 12:53:57 AM PST, Purushotham
> Pushpavanthar <[email protected]> wrote:
>
>  Hi Balaji/Vinoth,
>
> Below is the log we obtained from Hudi.
>
> 20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to
> be 20200122094611
> 20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is
> 20200122094611, Getting commits since then
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180108, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180221, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180102, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191007, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191128, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191127, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191127
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191006, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191006
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191009, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191009
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191129, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191129
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191008, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191008
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191120, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191120
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191122, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191122
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191001, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191001
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191121, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191121
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191124, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191124
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191003, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191003
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191002, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191002
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191123, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191123
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191005, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191005
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191126, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191126
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191125, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191125
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191004, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191004
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181208, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181208
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181207, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181207
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181206, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181206
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181205, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181205
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180117, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180117
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181209, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181209
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181204, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181204
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181203, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181203
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181202, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181202
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20181201, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20181201
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191117, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117
>
> Regards,
> Purushotham Pushpavanth
>
>
>
> On Tue, 4 Feb 2020 at 05:50, Vinoth Chandar <[email protected]> wrote:
>
> > Unfortunately, the mailing list does not support attachments, looks like
> :(
> > Could you paste it inline?
> >
> > On Sat, Feb 1, 2020 at 6:20 AM Purushotham Pushpavanthar <
> > [email protected]> wrote:
> >
> > > Hi Balaji,
> > >
> > > The attachment contains the logs you asked for.
> > > However, the only difference between storageValue and
> > > fullStoragePartitionPath is *target-base-path*.
> > > So if I'm not wrong, the code will be marking all partitions which got
> > > UPDATE data for partition update. Hence time consuming.
> > >
> > > Regards,
> > > Purushotham Pushpavanth
> > >
> > >
> > >
> > > On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan
> > > <[email protected]> wrote:
> > >
> > >>  Hi Purushotham,
> > >> I am unable to reproduce same  partitions getting hive-synced locally.
> > >> Can you add the following log message in HoodieHiveClient.java and run
> > the
> > >> code and send us logs.
> > >> diff --git
> > >> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> index 4578bb2f..ba4b1147 100644
> > >>
> > >> ---
> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> +++
> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> > >>
> > >> @@ -237,6 +237,8 @@ public class HoodieHiveClient {
> > >>
> > >>          if (!paths.containsKey(storageValue)) {
> > >>
> > >>
> > >> events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
> > >>
> > >>          } else if
> > >> (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
> > >>
> > >> +          LOG.info("Partition Location changes. StorageVal=" +
> > >> storageValue
> > >>
> > >> +              + ", Existing Hive Path=" + paths.get(storageValue) +
> ",
> > >> New Location=" + fullStoragePartitionPath);
> > >>
> > >>
> > >> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
> > >>
> > >>          }
> > >>
> > >>        }
> > >>
> > >> THanks,Balaji.V
> > >>    On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham
> > >> Pushpavanthar <[email protected]> wrote:
> > >>
> > >>  Hi,
> > >>
> > >> I noticed that
> > >> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is
> > time
> > >> consuming while running HUDI on set of records which contains data for
> > >> large set of partitions. All it is doing is setting location for each
> > >> updated partition path. However,
> > >> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
> > >> *is taking care of adding new partitions to the table.
> > >>
> > >>  1. For a given table, whose base path doesn't change (usually it
> > doesn't
> > >>  in production), why *updatePartitionsToTable() *is needed? Can you
> > >>  please throw some light on any such case where this is needed?
> > >>  2. If it is required, can we do something to optimise the time
> > consumed
> > >>  by this operation? Currently, the *Alter Statements* are executed one
> > by
> > >>  one on each (partition, path) pair for every updated partition.
> > >>
> > >>
> > >>
> > >> Regards,
> > >> Purushotham Pushpavanth
> > >>
> > >
> > >
> >
>

Reply via email to