[Spark] Optimize spark join on different keys for same data frame

Amit Joshi Sun, 03 Oct 2021 05:56:23 -0700

Hi Spark-Users,

Hope you are doing good.


I have been working on cases where a dataframe is joined with more than one
data frame separately, on different cols, that too frequently.
I was wondering how to optimize the join to make them faster.
We can consider the dataset to be big in size so broadcast joins is not an
option.

For eg:

schema_df1  = new StructType()
.add(StructField("key1", StringType, true))
.add(StructField("key2", StringType, true))
.add(StructField("val", DoubleType, true))


schema_df2  = new StructType()
.add(StructField("key1", StringType, true))
.add(StructField("val", DoubleType, true))


schema_df3  = new StructType()
.add(StructField("key2", StringType, true))
.add(StructField("val", DoubleType, true))

Now if we want to join
join1 =  df1.join(df2,"key1")
join2 =  df1.join(df3,"key2")

I was thinking of bucketing as a solution to speed up the joins. But if I
bucket df1 on the key1,then join2  may not benefit, and vice versa (if
bucket on key2 for df1).

or Should we bucket df1 twice, one with key1 and another with key2?
Is there a strategy to make both the joins faster for both the joins?


Regards
Amit Joshi

[Spark] Optimize spark join on different keys for same data frame

Reply via email to