Try running AnalyzeTableCommand on both tables first.

On Wed, Apr 18, 2018 at 2:57 AM Matteo Cossu <elco...@gmail.com> wrote:

> Can you check the value for spark.sql.autoBroadcastJoinThreshold?
>
> On 29 March 2018 at 14:41, Vitaliy Pisarev <vitaliy.pisa...@biocatch.com>
> wrote:
>
>> I am looking at the physical plan for the following query:
>>
>> SELECT f1,f2,f3,...
>> FROM T1
>> LEFT ANTI JOIN T2 ON T1.id = T2.id
>> WHERE  f1 = 'bla'
>>        AND f2 = 'bla2'
>>        AND some_date >= date_sub(current_date(), 1)
>> LIMIT 100
>>
>> An important detail: the table 'T1' can be very large (hundreds of
>> thousands of rows), but table T2 is rather small. Maximun in the thousands.
>> In this particular case, the table T2 has 2 rows.
>>
>> In the physical plan, I see that a SortMergeJoin is performed. Despite it
>> being the perfect candidate for a broadcast join.
>>
>> What could be the reason for this?
>> Is there a way to hint the optimizer to perform a broadcast join in the
>> sql syntax?
>>
>> I am writing this in pyspark and the query itself is over parquets stored
>> in Azure blob storage.
>>
>>
>>
>

Reply via email to