[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

Yi Zhou (JIRA) Wed, 04 Mar 2015 23:41:06 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348324#comment-14348324
 ]


Yi Zhou commented on SPARK-5791:
--------------------------------

Thank you [~yhuai]. Updated SparkSQL physical plan with  below parameters with 
great improved performance. But from latest test results, the query still get 
slow compared with Hive on M/R (~6min vs ~2min)
spark.sql.shuffle.partitions=200;
spark.sql.autoBroadcastJoinThreshold=209715200;
spark.serializer=org.apache.spark.serializer.KryoSerializer

== Physical Plan ==
InsertIntoHiveTable (MetastoreRelation bigbenchorc, 
q22_spark_run_query_0_result, None), Map(), false
 Sort [w_warehouse_name#674 ASC,i_item_id#651 ASC], false
  Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200)
   Filter (((inv_before#635L > 0) && ((CAST(inv_after#636L, DoubleType) / 
CAST(inv_before#635L, DoubleType)) >= 0.6666666666666666)) && 
((CAST(inv_after#636L, DoubleType) / CAST(inv_before#635L, DoubleType)) <= 1.5))
    Aggregate false, [w_warehouse_name#674,i_item_id#651], 
[w_warehouse_name#674,i_item_id#651,SUM(PartialSum#716L) AS 
inv_before#635L,SUM(PartialSum#717L) AS inv_after#636L]
     Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200)
      Aggregate true, [w_warehouse_name#674,i_item_id#651], 
[w_warehouse_name#674,i_item_id#651,SUM(CAST(CASE WHEN 
(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
 < 0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS 
PartialSum#716L,SUM(CAST(CASE WHEN 
(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
 >= 0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS PartialSum#717L]
       Project 
[w_warehouse_name#674,i_item_id#651,d_date#688,inv_quantity_on_hand#649]
        BroadcastHashJoin [inv_date_sk#646L], [d_date_sk#686L], BuildRight
         Project 
[i_item_id#651,w_warehouse_name#674,inv_date_sk#646L,inv_quantity_on_hand#649]
          BroadcastHashJoin [inv_warehouse_sk#648L], [w_warehouse_sk#672L], 
BuildRight
           Project 
[inv_warehouse_sk#648L,i_item_id#651,inv_date_sk#646L,inv_quantity_on_hand#649]
            BroadcastHashJoin [inv_item_sk#647L], [i_item_sk#650L], BuildRight
             HiveTableScan 
[inv_date_sk#646L,inv_item_sk#647L,inv_warehouse_sk#648L,inv_quantity_on_hand#649],
 (MetastoreRelation bigbenchorc, inventory, Some(inv)), None
             Project [i_item_id#651,i_item_sk#650L]
              Filter ((i_current_price#655 > 0.98) && (i_current_price#655 < 
1.5))
               HiveTableScan 
[i_item_id#651,i_item_sk#650L,i_current_price#655], (MetastoreRelation 
bigbenchorc, item, None), None
           HiveTableScan [w_warehouse_name#674,w_warehouse_sk#672L], 
(MetastoreRelation bigbenchorc, warehouse, Some(w)), None
         Filter 
((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
 >= -30) && 
(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
 <= 30))
          HiveTableScan [d_date_sk#686L,d_date#688], (MetastoreRelation 
bigbenchorc, date_dim, Some(d)), None
Time taken: 2.579 seconds


> [Spark SQL] show poor performance when multiple table do join operation
> -----------------------------------------------------------------------
>
>                 Key: SPARK-5791
>                 URL: https://issues.apache.org/jira/browse/SPARK-5791
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: Yi Zhou
>         Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt
>
>
> Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

Reply via email to