Re: Enforcing shuffle hash join

Sun Rui Mon, 04 Jul 2016 22:57:55 -0700

You can try set “spark.sql.join.preferSortMergeJoin” cons option to false.

For detailed join strategies, take a look at the source code of 
SparkStrategies.scala:
/**
 * Select the proper physical plan for join based on joining keys and size of 
logical plan.
 *
 * At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at 
least some of the
 * predicates can be evaluated by matching join keys. If found,  Join 
implementations are chosen
 * with the following precedence:
 *
 * - Broadcast: if one side of the join has an estimated physical size that is 
smaller than the
 *     user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold
 *     or if that side has an explicit broadcast hint (e.g. the user applied the
 *     [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), 
then that side
 *     of the join will be broadcasted and the other side will be streamed, 
with no shuffling
 *     performed. If both sides of the join are eligible to be broadcasted then 
the
 * - Shuffle hash join: if the average size of a single partition is small 
enough to build a hash
 *     table.
 * - Sort merge: if the matching join keys are sortable.
 *
 * If there is no joining keys, Join implementations are chosen with the 
following precedence:
 * - BroadcastNestedLoopJoin: if one side of the join could be broadcasted
 * - CartesianProduct: for Inner join
 * - BroadcastNestedLoopJoin
 */

> On Jul 5, 2016, at 13:28, Lalitha MV <lalitham...@gmail.com> wrote:
> 
> It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is set to 
> -1, or when the size of the small table is more than 
> spark.sql.spark.sql.autoBroadcastJoinThreshold.
> 
> On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro <linguin....@gmail.com 
> <mailto:linguin....@gmail.com>> wrote:
> The join selection can be described in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
>  
> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92>.
> If you have join keys, you can set -1 at 
> `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then, hash 
> joins are used in queries.
> 
> // maropu 
> 
> On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV <lalitham...@gmail.com 
> <mailto:lalitham...@gmail.com>> wrote:
> Hi maropu, 
> 
> Thanks for your reply. 
> 
> Would it be possible to write a rule for this, to make it always pick shuffle 
> hash join, over other join implementations(i.e. sort merge and broadcast)? 
> 
> Is there any documentation demonstrating rule based transformation for 
> physical plan trees? 
> 
> Thanks,
> Lalitha
> 
> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro <linguin....@gmail.com 
> <mailto:linguin....@gmail.com>> wrote:
> Hi,
> 
> No, spark has no hint for the hash join.
> 
> // maropu
> 
> On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV <lalitham...@gmail.com 
> <mailto:lalitham...@gmail.com>> wrote:
> Hi, 
> 
> In order to force broadcast hash join, we can set the 
> spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce 
> shuffle hash join in spark sql? 
> 
> 
> Thanks,
> Lalitha
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 
> 
> -- 
> Regards,
> Lalitha
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 
> 
> -- 
> Regards,
> Lalitha

Re: Enforcing shuffle hash join

Reply via email to