[
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348324#comment-14348324
]
Yi Zhou commented on SPARK-5791:
--------------------------------
Thank you [~yhuai]. Updated SparkSQL physical plan with below parameters with
great improved performance. But from latest test results, the query still get
slow compared with Hive on M/R (~6min vs ~2min)
spark.sql.shuffle.partitions=200;
spark.sql.autoBroadcastJoinThreshold=209715200;
spark.serializer=org.apache.spark.serializer.KryoSerializer
== Physical Plan ==
InsertIntoHiveTable (MetastoreRelation bigbenchorc,
q22_spark_run_query_0_result, None), Map(), false
Sort [w_warehouse_name#674 ASC,i_item_id#651 ASC], false
Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200)
Filter (((inv_before#635L > 0) && ((CAST(inv_after#636L, DoubleType) /
CAST(inv_before#635L, DoubleType)) >= 0.6666666666666666)) &&
((CAST(inv_after#636L, DoubleType) / CAST(inv_before#635L, DoubleType)) <= 1.5))
Aggregate false, [w_warehouse_name#674,i_item_id#651],
[w_warehouse_name#674,i_item_id#651,SUM(PartialSum#716L) AS
inv_before#635L,SUM(PartialSum#717L) AS inv_after#636L]
Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200)
Aggregate true, [w_warehouse_name#674,i_item_id#651],
[w_warehouse_name#674,i_item_id#651,SUM(CAST(CASE WHEN
(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
< 0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS
PartialSum#716L,SUM(CAST(CASE WHEN
(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
>= 0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS PartialSum#717L]
Project
[w_warehouse_name#674,i_item_id#651,d_date#688,inv_quantity_on_hand#649]
BroadcastHashJoin [inv_date_sk#646L], [d_date_sk#686L], BuildRight
Project
[i_item_id#651,w_warehouse_name#674,inv_date_sk#646L,inv_quantity_on_hand#649]
BroadcastHashJoin [inv_warehouse_sk#648L], [w_warehouse_sk#672L],
BuildRight
Project
[inv_warehouse_sk#648L,i_item_id#651,inv_date_sk#646L,inv_quantity_on_hand#649]
BroadcastHashJoin [inv_item_sk#647L], [i_item_sk#650L], BuildRight
HiveTableScan
[inv_date_sk#646L,inv_item_sk#647L,inv_warehouse_sk#648L,inv_quantity_on_hand#649],
(MetastoreRelation bigbenchorc, inventory, Some(inv)), None
Project [i_item_id#651,i_item_sk#650L]
Filter ((i_current_price#655 > 0.98) && (i_current_price#655 <
1.5))
HiveTableScan
[i_item_id#651,i_item_sk#650L,i_current_price#655], (MetastoreRelation
bigbenchorc, item, None), None
HiveTableScan [w_warehouse_name#674,w_warehouse_sk#672L],
(MetastoreRelation bigbenchorc, warehouse, Some(w)), None
Filter
((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
>= -30) &&
(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
<= 30))
HiveTableScan [d_date_sk#686L,d_date#688], (MetastoreRelation
bigbenchorc, date_dim, Some(d)), None
Time taken: 2.579 seconds
> [Spark SQL] show poor performance when multiple table do join operation
> -----------------------------------------------------------------------
>
> Key: SPARK-5791
> URL: https://issues.apache.org/jira/browse/SPARK-5791
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.2.0
> Reporter: Yi Zhou
> Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt
>
>
> Spark SQL shows poor performance when multiple tables do join operation
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]