[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348225#comment-14348225 ]
Yin Huai commented on SPARK-5791: --------------------------------- I see. In Hive's plan, all of item, warehouse, and date_dim are broadcast tables. However, in Spark SQL's plan, the join between item and inventory was a shuffle join. Can you set the value of spark.sql.autoBroadcastJoinThreshold larger than the size of item? Also, what is the value of spark.serializer? Using org.apache.spark.serializer.KryoSerializer for spark.serializer will also help the performance (we will use Kryo to serialize broadcast tables). > [Spark SQL] show poor performance when multiple table do join operation > ----------------------------------------------------------------------- > > Key: SPARK-5791 > URL: https://issues.apache.org/jira/browse/SPARK-5791 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.2.0 > Reporter: Yi Zhou > Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt > > > Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org