Re: Enforcing shuffle hash join

Lalitha MV Mon, 04 Jul 2016 23:45:07 -0700

By setting the preferSortMergeJoin to false, it still only picks between
Merge Join and Broadcast join. Does not pick shuffle hash join depending on
autobroadcastthreshold's value.
I went though the sparkstrategies, and doesn't look like there is a direct
clean way to enforce it.



On Mon, Jul 4, 2016 at 10:56 PM, Sun Rui <sunrise_...@163.com> wrote:

> You can try set *“spark.sql.join.preferSortMergeJoin” cons option to
> false.*
>
> *For detailed join strategies, take a look at the source code of
> SparkStrategies.scala:*
>
> /**
>  * Select the proper physical plan for join based on joining keys and size of 
> logical plan.
>  *
>  * At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at 
> least some of the
>  * predicates can be evaluated by matching join keys. If found,  Join 
> implementations are chosen
>  * with the following precedence:
>  *
>  * - Broadcast: if one side of the join has an estimated physical size that 
> is smaller than the
>  *     user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold
>  *     or if that side has an explicit broadcast hint (e.g. the user applied 
> the
>  *     [[org.apache.spark.sql.functions.broadcast()]] function to a 
> DataFrame), then that side
>  *     of the join will be broadcasted and the other side will be streamed, 
> with no shuffling
>  *     performed. If both sides of the join are eligible to be broadcasted 
> then the
>  * - Shuffle hash join: if the average size of a single partition is small 
> enough to build a hash
>  *     table.
>  * - Sort merge: if the matching join keys are sortable.
>  *
>  * If there is no joining keys, Join implementations are chosen with the 
> following precedence:
>  * - BroadcastNestedLoopJoin: if one side of the join could be broadcasted
>  * - CartesianProduct: for Inner join
>  * - BroadcastNestedLoopJoin
>  */
>
>
>
> On Jul 5, 2016, at 13:28, Lalitha MV <lalitham...@gmail.com> wrote:
>
> It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is
> set to -1, or when the size of the small table is more than spark.sql.
> spark.sql.autoBroadcastJoinThreshold.
>
> On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro <linguin....@gmail.com>
> wrote:
>
>> The join selection can be described in
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
>> .
>> If you have join keys, you can set -1 at
>> `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then,
>> hash joins are used in queries.
>>
>> // maropu
>>
>> On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV <lalitham...@gmail.com> wrote:
>>
>>> Hi maropu,
>>>
>>> Thanks for your reply.
>>>
>>> Would it be possible to write a rule for this, to make it always pick
>>> shuffle hash join, over other join implementations(i.e. sort merge and
>>> broadcast)?
>>>
>>> Is there any documentation demonstrating rule based transformation for
>>> physical plan trees?
>>>
>>> Thanks,
>>> Lalitha
>>>
>>> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro <linguin....@gmail.com
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> No, spark has no hint for the hash join.
>>>>
>>>> // maropu
>>>>
>>>> On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV <lalitham...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> In order to force broadcast hash join, we can set
>>>>> the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
>>>>> shuffle hash join in spark sql?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Lalitha
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Lalitha
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> Regards,
> Lalitha
>
>
>


-- 
Regards,
Lalitha

Re: Enforcing shuffle hash join

Reply via email to