RE: Auto BroadcastJoin optimization failed in latest Spark

Cheng, Hao Thu, 27 Nov 2014 21:26:46 -0800

Hi Jianshi,
I couldn’t reproduce that with latest MASTER, and I can always get the 
BroadcastHashJoin for managed tables (in .csv file) in my testing, are there 
any external tables in your case?


In general probably couple of things you can try first (with HiveContext):

1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all of the 
tables);

2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold as a 
greater value, it is 1024*1024*10 by default, just make sure the maximum 
dimension tables size (in bytes) is less than this)

3)      Always put the main table(the biggest table) in the left-most among the 
inner joins;

DESC EXTENDED tablename; -- this will print the detail information for the 
statistic table size (the field “totalSize”)
EXPLAIN EXTENDED query; -- this will print the detail physical plan.

Let me know if you still have problem.

Hao

From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Thursday, November 27, 2014 10:24 PM
To: Cheng, Hao
Cc: user
Subject: Re: Auto BroadcastJoin optimization failed in latest Spark

Hi Hao,

I'm using inner join as Broadcast join didn't work for left joins (thanks for 
the links for the latest improvements).

And I'm using HiveConext and it worked in a previous build (10/12) when joining 
15 dimension tables.

Jianshi

On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
Are all of your join keys the same? and I guess the join type are all “Left” 
join, https://github.com/apache/spark/pull/3362 probably is what you need.

And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) 
currently,  https://github.com/apache/spark/pull/3270 should be another 
optimization for this.


From: Jianshi Huang 
[mailto:jianshi.hu...@gmail.com<mailto:jianshi.hu...@gmail.com>]
Sent: Wednesday, November 26, 2014 4:36 PM
To: user
Subject: Auto BroadcastJoin optimization failed in latest Spark

Hi,

I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails 
optimizing auto broadcast join in my query. I have a query that joins a huge 
fact table with 15 tiny dimension tables.

I'm currently using an older version of Spark which was built on Oct. 12.

Anyone else has met similar situation?

--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/



--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

RE: Auto BroadcastJoin optimization failed in latest Spark

Reply via email to