You can try set “spark.sql.join.preferSortMergeJoin” cons option to false.
For detailed join strategies, take a look at the source code of SparkStrategies.scala: /** * Select the proper physical plan for join based on joining keys and size of logical plan. * * At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at least some of the * predicates can be evaluated by matching join keys. If found, Join implementations are chosen * with the following precedence: * * - Broadcast: if one side of the join has an estimated physical size that is smaller than the * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold * or if that side has an explicit broadcast hint (e.g. the user applied the * [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that side * of the join will be broadcasted and the other side will be streamed, with no shuffling * performed. If both sides of the join are eligible to be broadcasted then the * - Shuffle hash join: if the average size of a single partition is small enough to build a hash * table. * - Sort merge: if the matching join keys are sortable. * * If there is no joining keys, Join implementations are chosen with the following precedence: * - BroadcastNestedLoopJoin: if one side of the join could be broadcasted * - CartesianProduct: for Inner join * - BroadcastNestedLoopJoin */ > On Jul 5, 2016, at 13:28, Lalitha MV <lalitham...@gmail.com> wrote: > > It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is set to > -1, or when the size of the small table is more than > spark.sql.spark.sql.autoBroadcastJoinThreshold. > > On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro <linguin....@gmail.com > <mailto:linguin....@gmail.com>> wrote: > The join selection can be described in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92 > > <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92>. > If you have join keys, you can set -1 at > `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then, hash > joins are used in queries. > > // maropu > > On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV <lalitham...@gmail.com > <mailto:lalitham...@gmail.com>> wrote: > Hi maropu, > > Thanks for your reply. > > Would it be possible to write a rule for this, to make it always pick shuffle > hash join, over other join implementations(i.e. sort merge and broadcast)? > > Is there any documentation demonstrating rule based transformation for > physical plan trees? > > Thanks, > Lalitha > > On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro <linguin....@gmail.com > <mailto:linguin....@gmail.com>> wrote: > Hi, > > No, spark has no hint for the hash join. > > // maropu > > On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV <lalitham...@gmail.com > <mailto:lalitham...@gmail.com>> wrote: > Hi, > > In order to force broadcast hash join, we can set the > spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce > shuffle hash join in spark sql? > > > Thanks, > Lalitha > > > > -- > --- > Takeshi Yamamuro > > > > -- > Regards, > Lalitha > > > > -- > --- > Takeshi Yamamuro > > > > -- > Regards, > Lalitha