I haven't looked into your query yet, just want to let you know that: Spark
can only pick BroadcastNestedLoopJoin to implement left/right join. If the
table is very big, then OOM happens.

Maybe there is an algorithm to implement left/right join in a distributed
environment without broadcast, but currently Spark is only able to deal
with it using broadcast.

On Wed, Oct 23, 2019 at 6:02 PM zhangliyun <kelly...@126.com> wrote:

> Hi all:
> i want to ask a question about broadcast nestloop join? from google i
> know, that
>  left outer/semi join and right outer/semi join will use broadcast
> nestloop.
>   and in some cases, when the input data is very small, it is suitable to
> use. so here
>   how to define the input data very small? what parameter decides the
> threshold?  I just want to disable it ( i found that   set
> spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1
>  from testdata1 as a where a.key1 not in (select key3 from testdata3) )
>
>
> ```
>
> explain cost select a.key1  from testdata1 as a where a.key1 not in
> (select key3 from testdata3);
>
> == Physical Plan ==
> *(1) Project [key1#90]
> +- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) ||
> isnull((key1#90 = key3#92)))
>    :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
>    +- BroadcastExchange IdentityBroadcastMode
>       +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#92, value3#93]
>
> ```
>
>   my question is
>   1. why in not in subquery , BroadcastNestedLoopJoin is still used even i
> set spark.sql.autoBroadcastJoinThreshold= -1
>   2. which spark parameter  decides enable/disable BroadcastNestedLoopJoin.
>
>
>
> Appreciate if you have suggestion
>
>
> Best Regards
>
> Kelly Zhang
>
>
>
>

Reply via email to