Hi Michael,
Thanks for feedback. I am using version 1.5.2 now. Can you tell me how to enforce the broadcast join? I don’t want to let the engine to decide the execution path of join. I want to use hint or parameter to enforce broadcast join (because I also have some cases are inner join but I want to use broadcast join). Or is there any ticket or roadmap for this feature? Regards, Shuai From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Saturday, December 05, 2015 4:11 PM To: Shuai Zheng Cc: Jitesh chandra Mishra; user Subject: Re: Broadcasting a parquet file using spark and python I believe we started supporting broadcast outer joins in Spark 1.5. Which version are you using? On Fri, Dec 4, 2015 at 2:49 PM, Shuai Zheng <szheng.c...@gmail.com> wrote: Hi all, Sorry to re-open this thread. I have a similar issue, one big parquet file left outer join quite a few smaller parquet files. But the running is extremely slow and even OOM sometimes (with 300M , I have two questions here: 1, If I use outer join, will Spark SQL auto use broadcast hashjoin? 2, If not, in the latest documents: http://spark.apache.org/docs/latest/sql-programming-guide.html spark.sql.autoBroadcastJoinThreshold 10485760 (10 MB) Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan has been run. How can I do this (run command analyze table) in Java? I know I can code it by myself (create a broadcast val and implement lookup by myself), but it will make code super ugly. I hope we can have either API or hint to enforce the hashjoin (instead of this suspicious autoBroadcastJoinThreshold parameter). Do we have any ticket or roadmap for this feature? Regards, Shuai From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, April 01, 2015 2:01 PM To: Jitesh chandra Mishra Cc: user Subject: Re: Broadcasting a parquet file using spark and python You will need to create a hive parquet table that points to the data and run "ANALYZE TABLE tableName noscan" so that we have statistics on the size. On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra <jitesh...@gmail.com> wrote: Hi Michael, Thanks for your response. I am running 1.2.1. Is there any workaround to achieve the same with 1.2.1? Thanks, Jitesh On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust <mich...@databricks.com> wrote: In Spark 1.3 I would expect this to happen automatically when the parquet table is small (< 10mb, configurable with spark.sql.autoBroadcastJoinThreshold). If you are running 1.3 and not seeing this, can you show the code you are using to create the table? On Tue, Mar 31, 2015 at 3:25 AM, jitesh129 <jitesh...@gmail.com> wrote: How can we implement a BroadcastHashJoin for spark with python? My SparkSQL inner joins are taking a lot of time since it is performing ShuffledHashJoin. Tables on which join is performed are stored as parquet files. Please help. Thanks and regards, Jitesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org