Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
Sorry for the late of follow-up.

I used Hao's DESC EXTENDED command and found some clue:

new (broadcast broken Spark build):
parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}

old (broadcast working Spark build):
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
totalSize=56166}

Looks like the table size computation failed in the latest version.

I've run the analyze command:

  ANALYZE TABLE $table COMPUTE STATISTICS noscan

And the tables are created from Parquet files:

e.g.
CREATE EXTERNAL TABLE table1 (
  code int,
  desc string
)
STORED AS PARQUET
LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'


Anyone knows what went wrong?


Thanks,
Jianshi



On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao hao.ch...@intel.com wrote:

  Hi Jianshi,

 I couldn’t reproduce that with latest MASTER, and I can always get the
 BroadcastHashJoin for managed tables (in .csv file) in my testing, are
 there any external tables in your case?



 In general probably couple of things you can try first (with HiveContext):

 1)  ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
 of the tables);

 2)  SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold
 as a greater value, it is 1024*1024*10 by default, just make sure the
 maximum dimension tables size (in bytes) is less than this)

 3)  Always put the main table(the biggest table) in the left-most
 among the inner joins;



 DESC EXTENDED tablename; -- this will print the detail information for the
 statistic table size (the field “totalSize”)

 EXPLAIN EXTENDED query; -- this will print the detail physical plan.



 Let me know if you still have problem.



 Hao



 *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
 *Sent:* Thursday, November 27, 2014 10:24 PM
 *To:* Cheng, Hao
 *Cc:* user
 *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark



 Hi Hao,



 I'm using inner join as Broadcast join didn't work for left joins (thanks
 for the links for the latest improvements).



 And I'm using HiveConext and it worked in a previous build (10/12) when
 joining 15 dimension tables.



 Jianshi



 On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao hao.ch...@intel.com wrote:

  Are all of your join keys the same? and I guess the join type are all
 “Left” join, https://github.com/apache/spark/pull/3362 probably is what
 you need.



 And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
 join) currently,  https://github.com/apache/spark/pull/3270 should be
 another optimization for this.





 *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
 *Sent:* Wednesday, November 26, 2014 4:36 PM
 *To:* user
 *Subject:* Auto BroadcastJoin optimization failed in latest Spark



 Hi,



 I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails
 optimizing auto broadcast join in my query. I have a query that joins a
 huge fact table with 15 tiny dimension tables.



 I'm currently using an older version of Spark which was built on Oct. 12.



 Anyone else has met similar situation?



 --

 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





 --

 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
If I run ANALYZE without NOSCAN, then Hive can successfully get the size:

parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}

Is Hive's PARQUET support broken?

Jianshi


On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Sorry for the late of follow-up.

 I used Hao's DESC EXTENDED command and found some clue:

 new (broadcast broken Spark build):
 parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
 COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}

 old (broadcast working Spark build):
 parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
 totalSize=56166}

 Looks like the table size computation failed in the latest version.

 I've run the analyze command:

   ANALYZE TABLE $table COMPUTE STATISTICS noscan

 And the tables are created from Parquet files:

 e.g.
 CREATE EXTERNAL TABLE table1 (
   code int,
   desc string
 )
 STORED AS PARQUET
 LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'


 Anyone knows what went wrong?


 Thanks,
 Jianshi



 On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao hao.ch...@intel.com wrote:

  Hi Jianshi,

 I couldn’t reproduce that with latest MASTER, and I can always get the
 BroadcastHashJoin for managed tables (in .csv file) in my testing, are
 there any external tables in your case?



 In general probably couple of things you can try first (with HiveContext):

 1)  ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
 of the tables);

 2)  SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold
 as a greater value, it is 1024*1024*10 by default, just make sure the
 maximum dimension tables size (in bytes) is less than this)

 3)  Always put the main table(the biggest table) in the left-most
 among the inner joins;



 DESC EXTENDED tablename; -- this will print the detail information for
 the statistic table size (the field “totalSize”)

 EXPLAIN EXTENDED query; -- this will print the detail physical plan.



 Let me know if you still have problem.



 Hao



 *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
 *Sent:* Thursday, November 27, 2014 10:24 PM
 *To:* Cheng, Hao
 *Cc:* user
 *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark



 Hi Hao,



 I'm using inner join as Broadcast join didn't work for left joins (thanks
 for the links for the latest improvements).



 And I'm using HiveConext and it worked in a previous build (10/12) when
 joining 15 dimension tables.



 Jianshi



 On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao hao.ch...@intel.com wrote:

  Are all of your join keys the same? and I guess the join type are all
 “Left” join, https://github.com/apache/spark/pull/3362 probably is what
 you need.



 And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
 join) currently,  https://github.com/apache/spark/pull/3270 should be
 another optimization for this.





 *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
 *Sent:* Wednesday, November 26, 2014 4:36 PM
 *To:* user
 *Subject:* Auto BroadcastJoin optimization failed in latest Spark



 Hi,



 I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
 fails optimizing auto broadcast join in my query. I have a query that joins
 a huge fact table with 15 tiny dimension tables.



 I'm currently using an older version of Spark which was built on Oct. 12.



 Anyone else has met similar situation?



 --

 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





 --

 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
With Liancheng's suggestion, I've tried setting

 spark.sql.hive.convertMetastoreParquet  false

but still analyze noscan return -1 in rawDataSize

Jianshi


On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 If I run ANALYZE without NOSCAN, then Hive can successfully get the size:

 parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589,
 COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296}

 Is Hive's PARQUET support broken?

 Jianshi


 On Fri, Dec 5, 2014 at 3:30 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Sorry for the late of follow-up.

 I used Hao's DESC EXTENDED command and found some clue:

 new (broadcast broken Spark build):
 parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892,
 COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}

 old (broadcast working Spark build):
 parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591,
 totalSize=56166}

 Looks like the table size computation failed in the latest version.

 I've run the analyze command:

   ANALYZE TABLE $table COMPUTE STATISTICS noscan

 And the tables are created from Parquet files:

 e.g.
 CREATE EXTERNAL TABLE table1 (
   code int,
   desc string
 )
 STORED AS PARQUET
 LOCATION '/user/jianshuang/data/dim_tables/table1.parquet'


 Anyone knows what went wrong?


 Thanks,
 Jianshi



 On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao hao.ch...@intel.com wrote:

  Hi Jianshi,

 I couldn’t reproduce that with latest MASTER, and I can always get the
 BroadcastHashJoin for managed tables (in .csv file) in my testing, are
 there any external tables in your case?



 In general probably couple of things you can try first (with
 HiveContext):

 1)  ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all
 of the tables);

 2)  SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the
 threshold as a greater value, it is 1024*1024*10 by default, just make sure
 the maximum dimension tables size (in bytes) is less than this)

 3)  Always put the main table(the biggest table) in the left-most
 among the inner joins;



 DESC EXTENDED tablename; -- this will print the detail information for
 the statistic table size (the field “totalSize”)

 EXPLAIN EXTENDED query; -- this will print the detail physical plan.



 Let me know if you still have problem.



 Hao



 *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
 *Sent:* Thursday, November 27, 2014 10:24 PM
 *To:* Cheng, Hao
 *Cc:* user
 *Subject:* Re: Auto BroadcastJoin optimization failed in latest Spark



 Hi Hao,



 I'm using inner join as Broadcast join didn't work for left joins
 (thanks for the links for the latest improvements).



 And I'm using HiveConext and it worked in a previous build (10/12) when
 joining 15 dimension tables.



 Jianshi



 On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao hao.ch...@intel.com wrote:

  Are all of your join keys the same? and I guess the join type are all
 “Left” join, https://github.com/apache/spark/pull/3362 probably is what
 you need.



 And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
 join) currently,  https://github.com/apache/spark/pull/3270 should be
 another optimization for this.





 *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
 *Sent:* Wednesday, November 26, 2014 4:36 PM
 *To:* user
 *Subject:* Auto BroadcastJoin optimization failed in latest Spark



 Hi,



 I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1
 fails optimizing auto broadcast join in my query. I have a query that joins
 a huge fact table with 15 tiny dimension tables.



 I'm currently using an older version of Spark which was built on Oct. 12.



 Anyone else has met similar situation?



 --

 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





 --

 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Jianshi Huang
Hi Hao,

I'm using inner join as Broadcast join didn't work for left joins (thanks
for the links for the latest improvements).

And I'm using HiveConext and it worked in a previous build (10/12) when
joining 15 dimension tables.

Jianshi

On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao hao.ch...@intel.com wrote:

  Are all of your join keys the same? and I guess the join type are all
 “Left” join, https://github.com/apache/spark/pull/3362 probably is what
 you need.



 And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
 join) currently,  https://github.com/apache/spark/pull/3270 should be
 another optimization for this.





 *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
 *Sent:* Wednesday, November 26, 2014 4:36 PM
 *To:* user
 *Subject:* Auto BroadcastJoin optimization failed in latest Spark



 Hi,



 I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails
 optimizing auto broadcast join in my query. I have a query that joins a
 huge fact table with 15 tiny dimension tables.



 I'm currently using an older version of Spark which was built on Oct. 12.



 Anyone else has met similar situation?



 --

 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


RE: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Cheng, Hao
Hi Jianshi,
I couldn’t reproduce that with latest MASTER, and I can always get the 
BroadcastHashJoin for managed tables (in .csv file) in my testing, are there 
any external tables in your case?

In general probably couple of things you can try first (with HiveContext):

1)  ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all of the 
tables);

2)  SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold as a 
greater value, it is 1024*1024*10 by default, just make sure the maximum 
dimension tables size (in bytes) is less than this)

3)  Always put the main table(the biggest table) in the left-most among the 
inner joins;

DESC EXTENDED tablename; -- this will print the detail information for the 
statistic table size (the field “totalSize”)
EXPLAIN EXTENDED query; -- this will print the detail physical plan.

Let me know if you still have problem.

Hao

From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Thursday, November 27, 2014 10:24 PM
To: Cheng, Hao
Cc: user
Subject: Re: Auto BroadcastJoin optimization failed in latest Spark

Hi Hao,

I'm using inner join as Broadcast join didn't work for left joins (thanks for 
the links for the latest improvements).

And I'm using HiveConext and it worked in a previous build (10/12) when joining 
15 dimension tables.

Jianshi

On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com wrote:
Are all of your join keys the same? and I guess the join type are all “Left” 
join, https://github.com/apache/spark/pull/3362 probably is what you need.

And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) 
currently,  https://github.com/apache/spark/pull/3270 should be another 
optimization for this.


From: Jianshi Huang 
[mailto:jianshi.hu...@gmail.commailto:jianshi.hu...@gmail.com]
Sent: Wednesday, November 26, 2014 4:36 PM
To: user
Subject: Auto BroadcastJoin optimization failed in latest Spark

Hi,

I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails 
optimizing auto broadcast join in my query. I have a query that joins a huge 
fact table with 15 tiny dimension tables.

I'm currently using an older version of Spark which was built on Oct. 12.

Anyone else has met similar situation?

--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/



--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


RE: Auto BroadcastJoin optimization failed in latest Spark

2014-11-26 Thread Cheng, Hao
Are all of your join keys the same? and I guess the join type are all “Left” 
join, https://github.com/apache/spark/pull/3362 probably is what you need.

And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) 
currently,  https://github.com/apache/spark/pull/3270 should be another 
optimization for this.


From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Wednesday, November 26, 2014 4:36 PM
To: user
Subject: Auto BroadcastJoin optimization failed in latest Spark

Hi,

I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails 
optimizing auto broadcast join in my query. I have a query that joins a huge 
fact table with 15 tiny dimension tables.

I'm currently using an older version of Spark which was built on Oct. 12.

Anyone else has met similar situation?

--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/