[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

Yin Huai (JIRA) Wed, 04 Mar 2015 22:14:53 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348225#comment-14348225
 ]


Yin Huai commented on SPARK-5791:
---------------------------------

I see. In Hive's plan, all of item, warehouse, and date_dim are broadcast 
tables. However, in Spark SQL's plan, the join between item and inventory was a 
shuffle join. Can you set the value of spark.sql.autoBroadcastJoinThreshold 
larger than the size of item? Also, what is the value of spark.serializer? 
Using org.apache.spark.serializer.KryoSerializer for spark.serializer will also 
help the performance (we will use Kryo to serialize broadcast tables). 

> [Spark SQL] show poor performance when multiple table do join operation
> -----------------------------------------------------------------------
>
>                 Key: SPARK-5791
>                 URL: https://issues.apache.org/jira/browse/SPARK-5791
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: Yi Zhou
>         Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt
>
>
> Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

Reply via email to