Re: Auto BroadcastJoin optimization failed in latest Spark

Jianshi Huang Thu, 04 Dec 2014 23:52:45 -0800

With Liancheng's suggestion, I've tried setting

 spark.sql.hive.convertMetastoreParquet  false


but still analyze noscan return -1 in rawDataSize

Jianshi


On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang <jianshi.hu...@gmail.com>
wrote:

> If I run ANALYZE without NOSCAN, then Hive can successfully get the size:
>
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
> COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}
>
> Is Hive's PARQUET support broken?
>
> Jianshi
>
>
> On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang <jianshi.hu...@gmail.com>
> wrote:
>
>> Sorry for the late of follow-up.
>>
>> I used Hao's DESC EXTENDED command and found some clue:
>>
>> new (broadcast broken Spark build):
>> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
>> COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
>>
>> old (broadcast working Spark build):
>> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
>> totalSize=56166}
>>
>> Looks like the table size computation failed in the latest version.
>>
>> I've run the analyze command:
>>
>>   ANALYZE TABLE $table COMPUTE STATISTICS noscan
>>
>> And the tables are created from Parquet files:
>>
>> e.g.
>> CREATE EXTERNAL TABLE table1 (
>>   code int,
>>   desc string
>> )
>> STORED AS PARQUET
>> LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'
>>
>>
>> Anyone knows what went wrong?
>>
>>
>> Thanks,
>> Jianshi
>>
>>
>>
>> On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
>>
>>>  Hi Jianshi,
>>>
>>> I couldn’t reproduce that with latest MASTER, and I can always get the
>>> BroadcastHashJoin for managed tables (in .csv file) in my testing, are
>>> there any external tables in your case?
>>>
>>>
>>>
>>> In general probably couple of things you can try first (with
>>> HiveContext):
>>>
>>> 1)      ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
>>> of the tables);
>>>
>>> 2)      SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the
>>> threshold as a greater value, it is 1024*1024*10 by default, just make sure
>>> the maximum dimension tables size (in bytes) is less than this)
>>>
>>> 3)      Always put the main table(the biggest table) in the left-most
>>> among the inner joins;
>>>
>>>
>>>
>>> DESC EXTENDED tablename; -- this will print the detail information for
>>> the statistic table size (the field “totalSize”)
>>>
>>> EXPLAIN EXTENDED query; -- this will print the detail physical plan.
>>>
>>>
>>>
>>> Let me know if you still have problem.
>>>
>>>
>>>
>>> Hao
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
>>> *Sent:* Thursday, November 27, 2014 10:24 PM
>>> *To:* Cheng, Hao
>>> *Cc:* user
>>> *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark
>>>
>>>
>>>
>>> Hi Hao,
>>>
>>>
>>>
>>> I'm using inner join as Broadcast join didn't work for left joins
>>> (thanks for the links for the latest improvements).
>>>
>>>
>>>
>>> And I'm using HiveConext and it worked in a previous build (10/12) when
>>> joining 15 dimension tables.
>>>
>>>
>>>
>>> Jianshi
>>>
>>>
>>>
>>> On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao <hao.ch...@intel.com> wrote:
>>>
>>>  Are all of your join keys the same? and I guess the join type are all
>>> “Left” join, https://github.com/apache/spark/pull/3362 probably is what
>>> you need.
>>>
>>>
>>>
>>> And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
>>> join) currently,  https://github.com/apache/spark/pull/3270 should be
>>> another optimization for this.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
>>> *Sent:* Wednesday, November 26, 2014 4:36 PM
>>> *To:* user
>>> *Subject:* Auto BroadcastJoin optimization failed in latest Spark
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
>>> fails optimizing auto broadcast join in my query. I have a query that joins
>>> a huge fact table with 15 tiny dimension tables.
>>>
>>>
>>>
>>> I'm currently using an older version of Spark which was built on Oct. 12.
>>>
>>>
>>>
>>> Anyone else has met similar situation?
>>>
>>>
>>>
>>> --
>>>
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

Reply via email to