With Liancheng's suggestion, I've tried setting spark.sql.hive.convertMetastoreParquet false
but still analyze noscan return -1 in rawDataSize Jianshi On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang <jianshi.hu...@gmail.com> wrote: > If I run ANALYZE without NOSCAN, then Hive can successfully get the size: > > parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589, > COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296} > > Is Hive's PARQUET support broken? > > Jianshi > > > On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang <jianshi.hu...@gmail.com> > wrote: > >> Sorry for the late of follow-up. >> >> I used Hao's DESC EXTENDED command and found some clue: >> >> new (broadcast broken Spark build): >> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, >> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} >> >> old (broadcast working Spark build): >> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, >> totalSize=56166} >> >> Looks like the table size computation failed in the latest version. >> >> I've run the analyze command: >> >> ANALYZE TABLE $table COMPUTE STATISTICS noscan >> >> And the tables are created from Parquet files: >> >> e.g. >> CREATE EXTERNAL TABLE table1 ( >> code int, >> desc string >> ) >> STORED AS PARQUET >> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet' >> >> >> Anyone knows what went wrong? >> >> >> Thanks, >> Jianshi >> >> >> >> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <hao.ch...@intel.com> wrote: >> >>> Hi Jianshi, >>> >>> I couldn’t reproduce that with latest MASTER, and I can always get the >>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are >>> there any external tables in your case? >>> >>> >>> >>> In general probably couple of things you can try first (with >>> HiveContext): >>> >>> 1) ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all >>> of the tables); >>> >>> 2) SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the >>> threshold as a greater value, it is 1024*1024*10 by default, just make sure >>> the maximum dimension tables size (in bytes) is less than this) >>> >>> 3) Always put the main table(the biggest table) in the left-most >>> among the inner joins; >>> >>> >>> >>> DESC EXTENDED tablename; -- this will print the detail information for >>> the statistic table size (the field “totalSize”) >>> >>> EXPLAIN EXTENDED query; -- this will print the detail physical plan. >>> >>> >>> >>> Let me know if you still have problem. >>> >>> >>> >>> Hao >>> >>> >>> >>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] >>> *Sent:* Thursday, November 27, 2014 10:24 PM >>> *To:* Cheng, Hao >>> *Cc:* user >>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark >>> >>> >>> >>> Hi Hao, >>> >>> >>> >>> I'm using inner join as Broadcast join didn't work for left joins >>> (thanks for the links for the latest improvements). >>> >>> >>> >>> And I'm using HiveConext and it worked in a previous build (10/12) when >>> joining 15 dimension tables. >>> >>> >>> >>> Jianshi >>> >>> >>> >>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <hao.ch...@intel.com> wrote: >>> >>> Are all of your join keys the same? and I guess the join type are all >>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what >>> you need. >>> >>> >>> >>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast >>> join) currently, https://github.com/apache/spark/pull/3270 should be >>> another optimization for this. >>> >>> >>> >>> >>> >>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] >>> *Sent:* Wednesday, November 26, 2014 4:36 PM >>> *To:* user >>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark >>> >>> >>> >>> Hi, >>> >>> >>> >>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 >>> fails optimizing auto broadcast join in my query. I have a query that joins >>> a huge fact table with 15 tiny dimension tables. >>> >>> >>> >>> I'm currently using an older version of Spark which was built on Oct. 12. >>> >>> >>> >>> Anyone else has met similar situation? >>> >>> >>> >>> -- >>> >>> Jianshi Huang >>> >>> LinkedIn: jianshi >>> Twitter: @jshuang >>> Github & Blog: http://huangjs.github.com/ >>> >>> >>> >>> >>> >>> -- >>> >>> Jianshi Huang >>> >>> LinkedIn: jianshi >>> Twitter: @jshuang >>> Github & Blog: http://huangjs.github.com/ >>> >> >> >> >> -- >> Jianshi Huang >> >> LinkedIn: jianshi >> Twitter: @jshuang >> Github & Blog: http://huangjs.github.com/ >> > > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/