Hi Balaji,

The attachment contains the logs you asked for.
However, the only difference between storageValue and
fullStoragePartitionPath is *target-base-path*.
So if I'm not wrong, the code will be marking all partitions which got
UPDATE data for partition update. Hence time consuming.

Regards,
Purushotham Pushpavanth



On Mon, 20 Jan 2020 at 08:58, Balaji Varadarajan <[email protected]>
wrote:

>  Hi Purushotham,
> I am unable to reproduce same  partitions getting hive-synced locally. Can
> you add the following log message in HoodieHiveClient.java and run the code
> and send us logs.
> diff --git
> a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
> b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
>
> index 4578bb2f..ba4b1147 100644
>
> --- a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
>
> +++ b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
>
> @@ -237,6 +237,8 @@ public class HoodieHiveClient {
>
>          if (!paths.containsKey(storageValue)) {
>
>
> events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
>
>          } else if
> (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
>
> +          LOG.info("Partition Location changes. StorageVal=" +
> storageValue
>
> +              + ", Existing Hive Path=" + paths.get(storageValue) + ",
> New Location=" + fullStoragePartitionPath);
>
>
> events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
>
>          }
>
>        }
>
> THanks,Balaji.V
>     On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham
> Pushpavanthar <[email protected]> wrote:
>
>  Hi,
>
> I noticed that
> *org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is time
> consuming while running HUDI on set of records which contains data for
> large set of partitions. All it is doing is setting location for each
> updated partition path. However,
> *org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
> *is taking care of adding new partitions to the table.
>
>   1. For a given table, whose base path doesn't change (usually it doesn't
>   in production), why *updatePartitionsToTable() *is needed? Can you
>   please throw some light on any such case where this is needed?
>   2. If it is required, can we do something to optimise the time consumed
>   by this operation? Currently, the *Alter Statements* are executed one by
>   one on each (partition, path) pair for every updated partition.
>
>
>
> Regards,
> Purushotham Pushpavanth
>

Reply via email to