Re: Auto BroadcastJoin optimization failed in latest Spark

Jianshi Huang Thu, 04 Dec 2014 23:34:47 -0800

If I run ANALYZE without NOSCAN, then Hive can successfully get the size:

parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}


Is Hive's PARQUET support broken?

Jianshi


On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang <jianshi.hu...@gmail.com>
wrote:

> Sorry for the late of follow-up.
>
> I used Hao's DESC EXTENDED command and found some clue:
>
> new (broadcast broken Spark build):
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
>
> old (broadcast working Spark build):
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
> totalSize=56166}
>
> Looks like the table size computation failed in the latest version.
>
> I've run the analyze command:
>
>   ANALYZE TABLE $table COMPUTE STATISTICS noscan
>
> And the tables are created from Parquet files:
>
> e.g.
> CREATE EXTERNAL TABLE table1 (
>   code int,
>   desc string
> )
> STORED AS PARQUET
> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'
>
>
> Anyone knows what went wrong?
>
>
> Thanks,
> Jianshi
>
>
>
> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
>
>>  Hi Jianshi,
>>
>> I couldn’t reproduce that with latest MASTER, and I can always get the
>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
>> there any external tables in your case?
>>
>>
>>
>> In general probably couple of things you can try first (with HiveContext):
>>
>> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
>> of the tables);
>>
>> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold
>> as a greater value, it is 1024*1024*10 by default, just make sure the
>> maximum dimension tables size (in bytes) is less than this)
>>
>> 3)      Always put the main table(the biggest table) in the left-most
>> among the inner joins;
>>
>>
>>
>> DESC EXTENDED tablename; -- this will print the detail information for
>> the statistic table size (the field “totalSize”)
>>
>> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>>
>>
>>
>> Let me know if you still have problem.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
>> *Sent:* Thursday, November 27, 2014 10:24 PM
>> *To:* Cheng, Hao
>> *Cc:* user
>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>>
>>
>>
>> Hi Hao,
>>
>>
>>
>> I'm using inner join as Broadcast join didn't work for left joins (thanks
>> for the links for the latest improvements).
>>
>>
>>
>> And I'm using HiveConext and it worked in a previous build (10/12) when
>> joining 15 dimension tables.
>>
>>
>>
>> Jianshi
>>
>>
>>
>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <hao.ch...@intel.com> wrote:
>>
>>  Are all of your join keys the same? and I guess the join type are all
>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
>> you need.
>>
>>
>>
>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
>> join) currently,  https://github.com/apache/spark/pull/3270 should be
>> another optimization for this.
>>
>>
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
>> *Sent:* Wednesday, November 26, 2014 4:36 PM
>> *To:* user
>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>>
>>
>>
>> Hi,
>>
>>
>>
>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
>> fails optimizing auto broadcast join in my query. I have a query that joins
>> a huge fact table with 15 tiny dimension tables.
>>
>>
>>
>> I'm currently using an older version of Spark which was built on Oct. 12.
>>
>>
>>
>> Anyone else has met similar situation?
>>
>>
>>
>> --
>>
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>>
>>
>>
>>
>> --
>>
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Reply via email to