If I run ANALYZE without NOSCAN, then Hive can successfully get the size: parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589, COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}
Is Hive's PARQUET support broken? Jianshi On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang <jianshi.hu...@gmail.com> wrote: > Sorry for the late of follow-up. > > I used Hao's DESC EXTENDED command and found some clue: > > new (broadcast broken Spark build): > parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, > COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} > > old (broadcast working Spark build): > parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, > totalSize=56166} > > Looks like the table size computation failed in the latest version. > > I've run the analyze command: > > ANALYZE TABLE $table COMPUTE STATISTICS noscan > > And the tables are created from Parquet files: > > e.g. > CREATE EXTERNAL TABLE table1 ( > code int, > desc string > ) > STORED AS PARQUET > LOCATION '/user/jianshuang/data/dim_tables/table1.parquet' > > > Anyone knows what went wrong? > > > Thanks, > Jianshi > > > > On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <hao.ch...@intel.com> wrote: > >> Hi Jianshi, >> >> I couldn’t reproduce that with latest MASTER, and I can always get the >> BroadcastHashJoin for managed tables (in .csv file) in my testing, are >> there any external tables in your case? >> >> >> >> In general probably couple of things you can try first (with HiveContext): >> >> 1) ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all >> of the tables); >> >> 2) SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold >> as a greater value, it is 1024*1024*10 by default, just make sure the >> maximum dimension tables size (in bytes) is less than this) >> >> 3) Always put the main table(the biggest table) in the left-most >> among the inner joins; >> >> >> >> DESC EXTENDED tablename; -- this will print the detail information for >> the statistic table size (the field “totalSize”) >> >> EXPLAIN EXTENDED query; -- this will print the detail physical plan. >> >> >> >> Let me know if you still have problem. >> >> >> >> Hao >> >> >> >> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] >> *Sent:* Thursday, November 27, 2014 10:24 PM >> *To:* Cheng, Hao >> *Cc:* user >> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark >> >> >> >> Hi Hao, >> >> >> >> I'm using inner join as Broadcast join didn't work for left joins (thanks >> for the links for the latest improvements). >> >> >> >> And I'm using HiveConext and it worked in a previous build (10/12) when >> joining 15 dimension tables. >> >> >> >> Jianshi >> >> >> >> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <hao.ch...@intel.com> wrote: >> >> Are all of your join keys the same? and I guess the join type are all >> “Left” join, https://github.com/apache/spark/pull/3362 probably is what >> you need. >> >> >> >> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast >> join) currently, https://github.com/apache/spark/pull/3270 should be >> another optimization for this. >> >> >> >> >> >> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] >> *Sent:* Wednesday, November 26, 2014 4:36 PM >> *To:* user >> *Subject:* Auto BroadcastJoin optimization failed in latest Spark >> >> >> >> Hi, >> >> >> >> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 >> fails optimizing auto broadcast join in my query. I have a query that joins >> a huge fact table with 15 tiny dimension tables. >> >> >> >> I'm currently using an older version of Spark which was built on Oct. 12. >> >> >> >> Anyone else has met similar situation? >> >> >> >> -- >> >> Jianshi Huang >> >> LinkedIn: jianshi >> Twitter: @jshuang >> Github & Blog: http://huangjs.github.com/ >> >> >> >> >> >> -- >> >> Jianshi Huang >> >> LinkedIn: jianshi >> Twitter: @jshuang >> Github & Blog: http://huangjs.github.com/ >> > > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/