Re: Keeping Hive in Sync

selvaraj periyasamy Sat, 04 Jul 2020 23:23:26 -0700

Add some more info, my join condition would look for 180 days range folders.


On Sat, Jul 4, 2020 at 11:13 PM selvaraj periyasamy <
[email protected]> wrote:

> Team,
>
> I have a question on keeping hive in sync.  Due to a shared Hadoop
> Environment restricting me from using hudi 0.5.1 or higher version i ended
> up using 0.5.0.  Currently my hadoop cluster is having hive 1.2.x , which
> is not supporting Hudi to keep hive in sync.
>
> So , I am not using the hive feature. I am reading it as below.
>
>
> sparkSession.
> read.
> format("org.apache.hudi").
> load("/projects/cdp/data/base/request_application/*/*").
> createOrReplaceTempView(s"base_request_application")
>
>
> I am going to store 3 years worth of data partitioned by day/hour. When I
> load 3 years data, I would have (3*365*24) = 26280 directories. Using the
> above approach and reading every time, I see all the directories names are
> indexed. Would it impact the perfromance during joining with other table,
> if i dont use hive way of partition pruning?
>
> Thanks,
> Selva
>
>

Re: Keeping Hive in Sync

Reply via email to