subject:"Enforcing shuffle hash join"

?????? Enforcing shuffle hash join

2016-07-05 Thread ??????

you can try set "spark.shuffle.manager" to "hash".
this is the meaning of the parameter:
Implementation to use for shuffling data. There are two implementations
available:sort and hash. Sort-based shuffle is more memory-efficient and is the
default option starting in 1.2.

-- --
??: "Lalitha MV";<lalitham...@gmail.com>;
: 2016??7??5??(??) 2:44
??: "Sun Rui"<sunrise_...@163.com>;
: "Takeshi Yamamuro"<linguin@gmail.com>;
"user@spark.apache.org"<user@spark.apache.org>;
: Re: Enforcing shuffle hash join

By setting the preferSortMergeJoin to false, it still only picks between Merge
Join and Broadcast join. Does not pick shuffle hash join depending on
autobroadcastthreshold's value.
I went though the sparkstrategies, and doesn't look like there is a direct
clean way to enforce it.

On Mon, Jul 4, 2016 at 10:56 PM, Sun Rui <sunrise_...@163.com> wrote:
You can try set ??spark.sql.join.preferSortMergeJoin?? cons option to false.

For detailed join strategies, take a look at the source code of
SparkStrategies.scala:
/**
* Select the proper physical plan for join based on joining keys and size of
logical plan.
*
* At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at
least some of the
* predicates can be evaluated by matching join keys. If found, Join
implementations are chosen
* with the following precedence:
*
* - Broadcast: if one side of the join has an estimated physical size that is
smaller than the
* user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold
* or if that side has an explicit broadcast hint (e.g. the user applied the
* [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame),
then that side
* of the join will be broadcasted and the other side will be streamed,
with no shuffling
* performed. If both sides of the join are eligible to be broadcasted then
the
* - Shuffle hash join: if the average size of a single partition is small
enough to build a hash
* table.
* - Sort merge: if the matching join keys are sortable.
*
* If there is no joining keys, Join implementations are chosen with the
following precedence:
* - BroadcastNestedLoopJoin: if one side of the join could be broadcasted
* - CartesianProduct: for Inner join
* - BroadcastNestedLoopJoin
*/

On Jul 5, 2016, at 13:28, Lalitha MV <lalitham...@gmail.com> wrote:

It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is set to
-1, or when the size of the small table is more than
spark.sql.spark.sql.autoBroadcastJoinThreshold.

On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro <linguin@gmail.com> wrote:
The join selection can be described in
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92.
If you have join keys, you can set -1 at `spark.sql.autoBroadcastJoinThreshold`
to disable broadcast joins. Then, hash joins are used in queries.

// maropu

On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV <lalitham...@gmail.com> wrote:
Hi maropu,

Thanks for your reply.

Would it be possible to write a rule for this, to make it always pick shuffle
hash join, over other join implementations(i.e. sort merge and broadcast)?

Is there any documentation demonstrating rule based transformation for physical
plan trees?

Thanks,
Lalitha

On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro <linguin@gmail.com> wrote:
Hi,

No, spark has no hint for the hash join.

// maropu

On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV <lalitham...@gmail.com> wrote:
Hi,

In order to force broadcast hash join, we can set the
spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce shuffle
hash join in spark sql?

Thanks,Lalitha

--
---
Takeshi Yamamuro

--
Regards,Lalitha

--
---
Takeshi Yamamuro

--
Regards,Lalitha

Re: Enforcing shuffle hash join

2016-07-05 Thread Lalitha MV

By setting the preferSortMergeJoin to false, it still only picks between
Merge Join and Broadcast join. Does not pick shuffle hash join depending on
autobroadcastthreshold's value.
I went though the sparkstrategies, and doesn't look like there is a direct
clean way to enforce it.


On Mon, Jul 4, 2016 at 10:56 PM, Sun Rui  wrote:

> You can try set *“spark.sql.join.preferSortMergeJoin” cons option to
> false.*
>
> *For detailed join strategies, take a look at the source code of
> SparkStrategies.scala:*
>
> /**
>  * Select the proper physical plan for join based on joining keys and size of 
> logical plan.
>  *
>  * At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at 
> least some of the
>  * predicates can be evaluated by matching join keys. If found,  Join 
> implementations are chosen
>  * with the following precedence:
>  *
>  * - Broadcast: if one side of the join has an estimated physical size that 
> is smaller than the
>  * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold
>  * or if that side has an explicit broadcast hint (e.g. the user applied 
> the
>  * [[org.apache.spark.sql.functions.broadcast()]] function to a 
> DataFrame), then that side
>  * of the join will be broadcasted and the other side will be streamed, 
> with no shuffling
>  * performed. If both sides of the join are eligible to be broadcasted 
> then the
>  * - Shuffle hash join: if the average size of a single partition is small 
> enough to build a hash
>  * table.
>  * - Sort merge: if the matching join keys are sortable.
>  *
>  * If there is no joining keys, Join implementations are chosen with the 
> following precedence:
>  * - BroadcastNestedLoopJoin: if one side of the join could be broadcasted
>  * - CartesianProduct: for Inner join
>  * - BroadcastNestedLoopJoin
>  */
>
>
>
> On Jul 5, 2016, at 13:28, Lalitha MV  wrote:
>
> It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is
> set to -1, or when the size of the small table is more than spark.sql.
> spark.sql.autoBroadcastJoinThreshold.
>
> On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro 
> wrote:
>
>> The join selection can be described in
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
>> .
>> If you have join keys, you can set -1 at
>> `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then,
>> hash joins are used in queries.
>>
>> // maropu
>>
>> On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV  wrote:
>>
>>> Hi maropu,
>>>
>>> Thanks for your reply.
>>>
>>> Would it be possible to write a rule for this, to make it always pick
>>> shuffle hash join, over other join implementations(i.e. sort merge and
>>> broadcast)?
>>>
>>> Is there any documentation demonstrating rule based transformation for
>>> physical plan trees?
>>>
>>> Thanks,
>>> Lalitha
>>>
>>> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro >> > wrote:
>>>
 Hi,

 No, spark has no hint for the hash join.

 // maropu

 On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV 
 wrote:

> Hi,
>
> In order to force broadcast hash join, we can set
> the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
> shuffle hash join in spark sql?
>
>
> Thanks,
> Lalitha
>



 --
 ---
 Takeshi Yamamuro

>>>
>>>
>>>
>>> --
>>> Regards,
>>> Lalitha
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> Regards,
> Lalitha
>
>
>


-- 
Regards,
Lalitha

Re: Enforcing shuffle hash join

2016-07-04 Thread Sun Rui

You can try set “spark.sql.join.preferSortMergeJoin” cons option to false.

For detailed join strategies, take a look at the source code of 
SparkStrategies.scala:
/**
 * Select the proper physical plan for join based on joining keys and size of 
logical plan.
 *
 * At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at 
least some of the
 * predicates can be evaluated by matching join keys. If found,  Join 
implementations are chosen
 * with the following precedence:
 *
 * - Broadcast: if one side of the join has an estimated physical size that is 
smaller than the
 * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold
 * or if that side has an explicit broadcast hint (e.g. the user applied the
 * [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), 
then that side
 * of the join will be broadcasted and the other side will be streamed, 
with no shuffling
 * performed. If both sides of the join are eligible to be broadcasted then 
the
 * - Shuffle hash join: if the average size of a single partition is small 
enough to build a hash
 * table.
 * - Sort merge: if the matching join keys are sortable.
 *
 * If there is no joining keys, Join implementations are chosen with the 
following precedence:
 * - BroadcastNestedLoopJoin: if one side of the join could be broadcasted
 * - CartesianProduct: for Inner join
 * - BroadcastNestedLoopJoin
 */

> On Jul 5, 2016, at 13:28, Lalitha MV  wrote:
> 
> It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is set to 
> -1, or when the size of the small table is more than 
> spark.sql.spark.sql.autoBroadcastJoinThreshold.
> 
> On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro  > wrote:
> The join selection can be described in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
>  
> .
> If you have join keys, you can set -1 at 
> `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then, hash 
> joins are used in queries.
> 
> // maropu 
> 
> On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV  > wrote:
> Hi maropu, 
> 
> Thanks for your reply. 
> 
> Would it be possible to write a rule for this, to make it always pick shuffle 
> hash join, over other join implementations(i.e. sort merge and broadcast)? 
> 
> Is there any documentation demonstrating rule based transformation for 
> physical plan trees? 
> 
> Thanks,
> Lalitha
> 
> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro  > wrote:
> Hi,
> 
> No, spark has no hint for the hash join.
> 
> // maropu
> 
> On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV  > wrote:
> Hi, 
> 
> In order to force broadcast hash join, we can set the 
> spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce 
> shuffle hash join in spark sql? 
> 
> 
> Thanks,
> Lalitha
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 
> 
> -- 
> Regards,
> Lalitha
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 
> 
> -- 
> Regards,
> Lalitha

Re: Enforcing shuffle hash join

2016-07-04 Thread Takeshi Yamamuro

What's the query?

On Tue, Jul 5, 2016 at 2:28 PM, Lalitha MV  wrote:

> It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is
> set to -1, or when the size of the small table is more than spark.sql.
> spark.sql.autoBroadcastJoinThreshold.
>
> On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro 
> wrote:
>
>> The join selection can be described in
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
>> .
>> If you have join keys, you can set -1 at
>> `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then,
>> hash joins are used in queries.
>>
>> // maropu
>>
>> On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV  wrote:
>>
>>> Hi maropu,
>>>
>>> Thanks for your reply.
>>>
>>> Would it be possible to write a rule for this, to make it always pick
>>> shuffle hash join, over other join implementations(i.e. sort merge and
>>> broadcast)?
>>>
>>> Is there any documentation demonstrating rule based transformation for
>>> physical plan trees?
>>>
>>> Thanks,
>>> Lalitha
>>>
>>> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro >> > wrote:
>>>
 Hi,

 No, spark has no hint for the hash join.

 // maropu

 On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV 
 wrote:

> Hi,
>
> In order to force broadcast hash join, we can set
> the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
> shuffle hash join in spark sql?
>
>
> Thanks,
> Lalitha
>



 --
 ---
 Takeshi Yamamuro

>>>
>>>
>>>
>>> --
>>> Regards,
>>> Lalitha
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> Regards,
> Lalitha
>



-- 
---
Takeshi Yamamuro

Re: Enforcing shuffle hash join

2016-07-04 Thread Lalitha MV

It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is set
to -1, or when the size of the small table is more than spark.sql.spark.sql.
autoBroadcastJoinThreshold.

On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro 
wrote:

> The join selection can be described in
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
> .
> If you have join keys, you can set -1 at
> `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then,
> hash joins are used in queries.
>
> // maropu
>
> On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV  wrote:
>
>> Hi maropu,
>>
>> Thanks for your reply.
>>
>> Would it be possible to write a rule for this, to make it always pick
>> shuffle hash join, over other join implementations(i.e. sort merge and
>> broadcast)?
>>
>> Is there any documentation demonstrating rule based transformation for
>> physical plan trees?
>>
>> Thanks,
>> Lalitha
>>
>> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro 
>> wrote:
>>
>>> Hi,
>>>
>>> No, spark has no hint for the hash join.
>>>
>>> // maropu
>>>
>>> On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV 
>>> wrote:
>>>
 Hi,

 In order to force broadcast hash join, we can set
 the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
 shuffle hash join in spark sql?


 Thanks,
 Lalitha

>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>>
>> --
>> Regards,
>> Lalitha
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
Regards,
Lalitha

Re: Enforcing shuffle hash join

2016-07-04 Thread Takeshi Yamamuro

The join selection can be described in
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
.
If you have join keys, you can set -1 at
`spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then,
hash joins are used in queries.

// maropu

On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV  wrote:

> Hi maropu,
>
> Thanks for your reply.
>
> Would it be possible to write a rule for this, to make it always pick
> shuffle hash join, over other join implementations(i.e. sort merge and
> broadcast)?
>
> Is there any documentation demonstrating rule based transformation for
> physical plan trees?
>
> Thanks,
> Lalitha
>
> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> No, spark has no hint for the hash join.
>>
>> // maropu
>>
>> On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV  wrote:
>>
>>> Hi,
>>>
>>> In order to force broadcast hash join, we can set
>>> the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
>>> shuffle hash join in spark sql?
>>>
>>>
>>> Thanks,
>>> Lalitha
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> Regards,
> Lalitha
>



-- 
---
Takeshi Yamamuro

Re: Enforcing shuffle hash join

2016-07-04 Thread Lalitha MV

Hi maropu,

Thanks for your reply.

Would it be possible to write a rule for this, to make it always pick
shuffle hash join, over other join implementations(i.e. sort merge and
broadcast)?

Is there any documentation demonstrating rule based transformation for
physical plan trees?

Thanks,
Lalitha

On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro 
wrote:

> Hi,
>
> No, spark has no hint for the hash join.
>
> // maropu
>
> On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV  wrote:
>
>> Hi,
>>
>> In order to force broadcast hash join, we can set
>> the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
>> shuffle hash join in spark sql?
>>
>>
>> Thanks,
>> Lalitha
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>

-- 
Regards,
Lalitha

Re: Enforcing shuffle hash join

2016-07-02 Thread Takeshi Yamamuro

Hi,

No, spark has no hint for the hash join.

// maropu

On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV  wrote:

> Hi,
>
> In order to force broadcast hash join, we can set
> the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
> shuffle hash join in spark sql?
>
>
> Thanks,
> Lalitha
>



-- 
---
Takeshi Yamamuro

Enforcing shuffle hash join

2016-07-01 Thread Lalitha MV

Hi,

In order to force broadcast hash join, we can set
the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
shuffle hash join in spark sql?


Thanks,
Lalitha

?????? Enforcing shuffle hash join

Re: Enforcing shuffle hash join

Re: Enforcing shuffle hash join

Re: Enforcing shuffle hash join

Re: Enforcing shuffle hash join

Re: Enforcing shuffle hash join

Re: Enforcing shuffle hash join

Re: Enforcing shuffle hash join

Enforcing shuffle hash join

9 matches

Site Navigation

Mail list logo

Footer information