Sorry for the late of follow-up. I used Hao's DESC EXTENDED command and found some clue:
new (broadcast broken Spark build): parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} old (broadcast working Spark build): parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166} Looks like the table size computation failed in the latest version. I've run the analyze command: ANALYZE TABLE $table COMPUTE STATISTICS noscan And the tables are created from Parquet files: e.g. CREATE EXTERNAL TABLE table1 ( code int, desc string ) STORED AS PARQUET LOCATION '/user/jianshuang/data/dim_tables/table1.parquet' Anyone knows what went wrong? Thanks, Jianshi On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <hao.ch...@intel.com> wrote: > Hi Jianshi, > > I couldn’t reproduce that with latest MASTER, and I can always get the > BroadcastHashJoin for managed tables (in .csv file) in my testing, are > there any external tables in your case? > > > > In general probably couple of things you can try first (with HiveContext): > > 1) ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all > of the tables); > > 2) SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold > as a greater value, it is 1024*1024*10 by default, just make sure the > maximum dimension tables size (in bytes) is less than this) > > 3) Always put the main table(the biggest table) in the left-most > among the inner joins; > > > > DESC EXTENDED tablename; -- this will print the detail information for the > statistic table size (the field “totalSize”) > > EXPLAIN EXTENDED query; -- this will print the detail physical plan. > > > > Let me know if you still have problem. > > > > Hao > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Thursday, November 27, 2014 10:24 PM > *To:* Cheng, Hao > *Cc:* user > *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark > > > > Hi Hao, > > > > I'm using inner join as Broadcast join didn't work for left joins (thanks > for the links for the latest improvements). > > > > And I'm using HiveConext and it worked in a previous build (10/12) when > joining 15 dimension tables. > > > > Jianshi > > > > On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <hao.ch...@intel.com> wrote: > > Are all of your join keys the same? and I guess the join type are all > “Left” join, https://github.com/apache/spark/pull/3362 probably is what > you need. > > > > And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast > join) currently, https://github.com/apache/spark/pull/3270 should be > another optimization for this. > > > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Wednesday, November 26, 2014 4:36 PM > *To:* user > *Subject:* Auto BroadcastJoin optimization failed in latest Spark > > > > Hi, > > > > I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails > optimizing auto broadcast join in my query. I have a query that joins a > huge fact table with 15 tiny dimension tables. > > > > I'm currently using an older version of Spark which was built on Oct. 12. > > > > Anyone else has met similar situation? > > > > -- > > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > > > > > > -- > > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/