Try running AnalyzeTableCommand on both tables first. On Wed, Apr 18, 2018 at 2:57 AM Matteo Cossu <elco...@gmail.com> wrote:
> Can you check the value for spark.sql.autoBroadcastJoinThreshold? > > On 29 March 2018 at 14:41, Vitaliy Pisarev <vitaliy.pisa...@biocatch.com> > wrote: > >> I am looking at the physical plan for the following query: >> >> SELECT f1,f2,f3,... >> FROM T1 >> LEFT ANTI JOIN T2 ON T1.id = T2.id >> WHERE f1 = 'bla' >> AND f2 = 'bla2' >> AND some_date >= date_sub(current_date(), 1) >> LIMIT 100 >> >> An important detail: the table 'T1' can be very large (hundreds of >> thousands of rows), but table T2 is rather small. Maximun in the thousands. >> In this particular case, the table T2 has 2 rows. >> >> In the physical plan, I see that a SortMergeJoin is performed. Despite it >> being the perfect candidate for a broadcast join. >> >> What could be the reason for this? >> Is there a way to hint the optimizer to perform a broadcast join in the >> sql syntax? >> >> I am writing this in pyspark and the query itself is over parquets stored >> in Azure blob storage. >> >> >> >