RE: Broadcasting a parquet file using spark and python

Shuai Zheng Mon, 07 Dec 2015 08:15:12 -0800

Hi Michael,


Thanks for feedback.

 

I am using version 1.5.2 now.

 

Can you tell me how to enforce the broadcast join? I don’t want to let the 
engine to decide the execution path of join. I want to use hint or parameter to 
enforce broadcast join (because I also have some cases are inner join but I 
want to use broadcast join).

 

Or is there any ticket or roadmap for this feature?

 

Regards,

 

Shuai

 

 

From: Michael Armbrust [mailto:mich...@databricks.com] 
Sent: Saturday, December 05, 2015 4:11 PM
To: Shuai Zheng
Cc: Jitesh chandra Mishra; user
Subject: Re: Broadcasting a parquet file using spark and python

 

I believe we started supporting broadcast outer joins in Spark 1.5.  Which 
version are you using? 

 

On Fri, Dec 4, 2015 at 2:49 PM, Shuai Zheng <szheng.c...@gmail.com> wrote:

Hi all,

 

Sorry to re-open this thread.

 

I have a similar issue, one big parquet file left outer join quite a few 
smaller parquet files. But the running is extremely slow and even OOM sometimes 
(with 300M , I have two questions here:

 

1, If I use outer join, will Spark SQL auto use broadcast hashjoin?

2, If not, in the latest documents: 
http://spark.apache.org/docs/latest/sql-programming-guide.html

 


spark.sql.autoBroadcastJoinThreshold

10485760 (10 MB)

Configures the maximum size in bytes for a table that will be broadcast to all 
worker nodes when performing a join. By setting this value to -1 broadcasting 
can be disabled. Note that currently statistics are only supported for Hive 
Metastore tables where the command ANALYZE TABLE <tableName> COMPUTE STATISTICS 
noscan has been run.

 

How can I do this (run command analyze table) in Java? I know I can code it by 
myself (create a broadcast val and implement lookup by myself), but it will 
make code super ugly.

 

I hope we can have either API or hint to enforce the hashjoin (instead of this 
suspicious autoBroadcastJoinThreshold parameter). Do we have any ticket or 
roadmap for this feature?

 

Regards,

 

Shuai

 

From: Michael Armbrust [mailto:mich...@databricks.com] 
Sent: Wednesday, April 01, 2015 2:01 PM
To: Jitesh chandra Mishra
Cc: user
Subject: Re: Broadcasting a parquet file using spark and python

 

You will need to create a hive parquet table that points to the data and run 
"ANALYZE TABLE tableName noscan" so that we have statistics on the size.

 

On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra <jitesh...@gmail.com> 
wrote:

Hi Michael,

 

Thanks for your response. I am running 1.2.1. 

 

Is there any workaround to achieve the same with 1.2.1?

 

Thanks,

Jitesh

 

On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust <mich...@databricks.com> 
wrote:

In Spark 1.3 I would expect this to happen automatically when the parquet table 
is small (< 10mb, configurable with spark.sql.autoBroadcastJoinThreshold).  If 
you are running 1.3 and not seeing this, can you show the code you are using to 
create the table?

 

On Tue, Mar 31, 2015 at 3:25 AM, jitesh129 <jitesh...@gmail.com> wrote:

How can we implement a BroadcastHashJoin for spark with python?

My SparkSQL inner joins are taking a lot of time since it is performing
ShuffledHashJoin.

Tables on which join is performed are stored as parquet files.

Please help.

Thanks and regards,
Jitesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Broadcasting a parquet file using spark and python

Reply via email to