Re: Subquery performance

2016-03-20 Thread Michael Armbrust
t? > > > > y > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com] > *Sent:* March-17-16 8:59 PM > *To:* Younes Naguib > *Cc:* user@spark.apache.org > *Subject:* Re: Subquery performance > > > > Try running EXPLAIN on both version of the query.

Re: Subquery performance

2016-03-19 Thread Michael Armbrust
Try running EXPLAIN on both version of the query. Likely when you cache the subquery we know that its going to be small so use a broadcast join instead of a shuffling the data. On Thu, Mar 17, 2016 at 5:53 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi all, > > > > I’m running

RE: Subquery performance

2016-03-19 Thread Younes Naguib
Anyways to cache the subquery or force a broadcast join without persisting it? y From: Michael Armbrust [mailto:mich...@databricks.com] Sent: March-17-16 8:59 PM To: Younes Naguib Cc: user@spark.apache.org Subject: Re: Subquery performance Try running EXPLAIN on both version of the query

Subquery performance

2016-03-19 Thread Younes Naguib
Hi all, I'm running a query that looks like the following: Select col1, count(1) >From (Select col2, count(1) from tab2 group by col2) Inner join tab1 on (col1=col2) Group by col1 This creates a very large shuffle, 10 times the data size, as if the subquery was executed for each row. Anything