Hi Balaji, The attachment contains the logs you asked for. However, the only difference between storageValue and fullStoragePartitionPath is *target-base-path*. So if I'm not wrong, the code will be marking all partitions which got UPDATE data for partition update. Hence time consuming.
Regards, Purushotham Pushpavanth On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan <[email protected]> wrote: > Hi Purushotham, > I am unable to reproduce same partitions getting hive-synced locally. Can > you add the following log message in HoodieHiveClient.java and run the code > and send us logs. > diff --git > a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java > b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java > > index 4578bb2f..ba4b1147 100644 > > --- a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java > > +++ b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java > > @@ -237,6 +237,8 @@ public class HoodieHiveClient { > > if (!paths.containsKey(storageValue)) { > > > events.add(PartitionEvent.newPartitionAddEvent(storagePartition)); > > } else if > (!paths.get(storageValue).equals(fullStoragePartitionPath)) { > > + LOG.info("Partition Location changes. StorageVal=" + > storageValue > > + + ", Existing Hive Path=" + paths.get(storageValue) + ", > New Location=" + fullStoragePartitionPath); > > > events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition)); > > } > > } > > THanks,Balaji.V > On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham > Pushpavanthar <[email protected]> wrote: > > Hi, > > I noticed that > *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is time > consuming while running HUDI on set of records which contains data for > large set of partitions. All it is doing is setting location for each > updated partition path. However, > *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable() > *is taking care of adding new partitions to the table. > > 1. For a given table, whose base path doesn't change (usually it doesn't > in production), why *updatePartitionsToTable() *is needed? Can you > please throw some light on any such case where this is needed? > 2. If it is required, can we do something to optimise the time consumed > by this operation? Currently, the *Alter Statements* are executed one by > one on each (partition, path) pair for every updated partition. > > > > Regards, > Purushotham Pushpavanth >
