Re: Migration from Spark 2.4.0 to Spark 3.1.1 caused SortMergeJoin to change to BroadcastHashJoin

2022-07-06 Thread igor cabral uchoa
Thanks for the suggestions! My concern is why changing to Spark 3 has triggered the BroadcastJoin? The “spark.sql.autoBroadcastJoinThreshold” is 10mb for both engines, but when I run the same code in Spark 2 it uses SortMergeJoin, but running using Spark 3, it uses BHJ. Before change the configs

Re: Migration from Spark 2.4.0 to Spark 3.1.1 caused SortMergeJoin to change to BroadcastHashJoin

2022-07-06 Thread Tufan Rakshit
There are a few solutions : 1. Please make sure your driver has enough memory to broadcast the smaller dataframe . 2. Please change the config "spark.sql.autoBroadcastJoinThreshold": "2g" this an example 3. please use Hint in the Join , you need to scroll a bit down https://spark.apache.org/docs/l

Migration from Spark 2.4.0 to Spark 3.1.1 caused SortMergeJoin to change to BroadcastHashJoin

2022-07-06 Thread igor cabral uchoa
Hi all, I hope everyone is doing well.  I'm currently working on a Spark migration project that aims to migrate all Spark SQL pipelines for Spark 3.x version and take advantage of all performance improvements on it. My company is using Spark 2.4.0 but we are targeting to use officially the 3.1.1