Hi spark users, Can anyone please provide any views on the topic.
Regards Amit Joshi On Sunday, October 3, 2021, Amit Joshi <mailtojoshia...@gmail.com> wrote: > Hi Spark-Users, > > Hope you are doing good. > > I have been working on cases where a dataframe is joined with more than > one data frame separately, on different cols, that too frequently. > I was wondering how to optimize the join to make them faster. > We can consider the dataset to be big in size so broadcast joins is not an > option. > > For eg: > > schema_df1 = new StructType() > .add(StructField("key1", StringType, true)) > .add(StructField("key2", StringType, true)) > .add(StructField("val", DoubleType, true)) > > > schema_df2 = new StructType() > .add(StructField("key1", StringType, true)) > .add(StructField("val", DoubleType, true)) > > > schema_df3 = new StructType() > .add(StructField("key2", StringType, true)) > .add(StructField("val", DoubleType, true)) > > Now if we want to join > join1 = df1.join(df2,"key1") > join2 = df1.join(df3,"key2") > > I was thinking of bucketing as a solution to speed up the joins. But if I > bucket df1 on the key1,then join2 may not benefit, and vice versa (if > bucket on key2 for df1). > > or Should we bucket df1 twice, one with key1 and another with key2? > Is there a strategy to make both the joins faster for both the joins? > > > Regards > Amit Joshi > > > >