Add some more info, my join condition would look for 180 days range folders.
On Sat, Jul 4, 2020 at 11:13 PM selvaraj periyasamy < [email protected]> wrote: > Team, > > I have a question on keeping hive in sync. Due to a shared Hadoop > Environment restricting me from using hudi 0.5.1 or higher version i ended > up using 0.5.0. Currently my hadoop cluster is having hive 1.2.x , which > is not supporting Hudi to keep hive in sync. > > So , I am not using the hive feature. I am reading it as below. > > > sparkSession. > read. > format("org.apache.hudi"). > load("/projects/cdp/data/base/request_application/*/*"). > createOrReplaceTempView(s"base_request_application") > > > I am going to store 3 years worth of data partitioned by day/hour. When I > load 3 years data, I would have (3*365*24) = 26280 directories. Using the > above approach and reading every time, I see all the directories names are > indexed. Would it impact the perfromance during joining with other table, > if i dont use hive way of partition pruning? > > Thanks, > Selva > >
