Thanks for the suggestions!
My concern is why changing to Spark 3 has triggered the BroadcastJoin? The
“spark.sql.autoBroadcastJoinThreshold” is 10mb for both engines, but when I run
the same code in Spark 2 it uses SortMergeJoin, but running using Spark 3, it
uses BHJ.
Before change the configs
There are a few solutions :
1. Please make sure your driver has enough memory to broadcast the smaller
dataframe .
2. Please change the config "spark.sql.autoBroadcastJoinThreshold": "2g"
this an example
3. please use Hint in the Join , you need to scroll a bit down
https://spark.apache.org/docs/l
Hi all, I hope everyone is doing well.
I'm currently working on a Spark migration project that aims to migrate all
Spark SQL pipelines for Spark 3.x version and take advantage of all performance
improvements on it. My company is using Spark 2.4.0 but we are targeting to use
officially the 3.1.1