By setting the preferSortMergeJoin to false, it still only picks between Merge Join and Broadcast join. Does not pick shuffle hash join depending on autobroadcastthreshold's value. I went though the sparkstrategies, and doesn't look like there is a direct clean way to enforce it.
On Mon, Jul 4, 2016 at 10:56 PM, Sun Rui <sunrise_...@163.com> wrote: > You can try set *“spark.sql.join.preferSortMergeJoin” cons option to > false.* > > *For detailed join strategies, take a look at the source code of > SparkStrategies.scala:* > > /** > * Select the proper physical plan for join based on joining keys and size of > logical plan. > * > * At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at > least some of the > * predicates can be evaluated by matching join keys. If found, Join > implementations are chosen > * with the following precedence: > * > * - Broadcast: if one side of the join has an estimated physical size that > is smaller than the > * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold > * or if that side has an explicit broadcast hint (e.g. the user applied > the > * [[org.apache.spark.sql.functions.broadcast()]] function to a > DataFrame), then that side > * of the join will be broadcasted and the other side will be streamed, > with no shuffling > * performed. If both sides of the join are eligible to be broadcasted > then the > * - Shuffle hash join: if the average size of a single partition is small > enough to build a hash > * table. > * - Sort merge: if the matching join keys are sortable. > * > * If there is no joining keys, Join implementations are chosen with the > following precedence: > * - BroadcastNestedLoopJoin: if one side of the join could be broadcasted > * - CartesianProduct: for Inner join > * - BroadcastNestedLoopJoin > */ > > > > On Jul 5, 2016, at 13:28, Lalitha MV <lalitham...@gmail.com> wrote: > > It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is > set to -1, or when the size of the small table is more than spark.sql. > spark.sql.autoBroadcastJoinThreshold. > > On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro <linguin....@gmail.com> > wrote: > >> The join selection can be described in >> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92 >> . >> If you have join keys, you can set -1 at >> `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then, >> hash joins are used in queries. >> >> // maropu >> >> On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV <lalitham...@gmail.com> wrote: >> >>> Hi maropu, >>> >>> Thanks for your reply. >>> >>> Would it be possible to write a rule for this, to make it always pick >>> shuffle hash join, over other join implementations(i.e. sort merge and >>> broadcast)? >>> >>> Is there any documentation demonstrating rule based transformation for >>> physical plan trees? >>> >>> Thanks, >>> Lalitha >>> >>> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro <linguin....@gmail.com >>> > wrote: >>> >>>> Hi, >>>> >>>> No, spark has no hint for the hash join. >>>> >>>> // maropu >>>> >>>> On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV <lalitham...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> In order to force broadcast hash join, we can set >>>>> the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce >>>>> shuffle hash join in spark sql? >>>>> >>>>> >>>>> Thanks, >>>>> Lalitha >>>>> >>>> >>>> >>>> >>>> -- >>>> --- >>>> Takeshi Yamamuro >>>> >>> >>> >>> >>> -- >>> Regards, >>> Lalitha >>> >> >> >> >> -- >> --- >> Takeshi Yamamuro >> > > > > -- > Regards, > Lalitha > > > -- Regards, Lalitha