[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348324#comment-14348324 ]
Yi Zhou edited comment on SPARK-5791 at 3/5/15 7:40 AM: -------------------------------------------------------- Thank you [~yhuai]. Updated SparkSQL physical plan with below parameters with great improved performance. But from latest test results, the query still get slow compared with Hive on M/R (~6min vs ~2min) spark.sql.shuffle.partitions=200; spark.sql.autoBroadcastJoinThreshold=209715200; spark.serializer=org.apache.spark.serializer.KryoSerializer was (Author: jameszhouyi): Thank you [~yhuai]. Updated SparkSQL physical plan with below parameters with great improved performance. But from latest test results, the query still get slow compared with Hive on M/R (~6min vs ~2min) spark.sql.shuffle.partitions=200; spark.sql.autoBroadcastJoinThreshold=209715200; spark.serializer=org.apache.spark.serializer.KryoSerializer == Physical Plan == InsertIntoHiveTable (MetastoreRelation bigbenchorc, q22_spark_run_query_0_result, None), Map(), false Sort [w_warehouse_name#674 ASC,i_item_id#651 ASC], false Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200) Filter (((inv_before#635L > 0) && ((CAST(inv_after#636L, DoubleType) / CAST(inv_before#635L, DoubleType)) >= 0.6666666666666666)) && ((CAST(inv_after#636L, DoubleType) / CAST(inv_before#635L, DoubleType)) <= 1.5)) Aggregate false, [w_warehouse_name#674,i_item_id#651], [w_warehouse_name#674,i_item_id#651,SUM(PartialSum#716L) AS inv_before#635L,SUM(PartialSum#717L) AS inv_after#636L] Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200) Aggregate true, [w_warehouse_name#674,i_item_id#651], [w_warehouse_name#674,i_item_id#651,SUM(CAST(CASE WHEN (HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08) < 0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS PartialSum#716L,SUM(CAST(CASE WHEN (HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08) >= 0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS PartialSum#717L] Project [w_warehouse_name#674,i_item_id#651,d_date#688,inv_quantity_on_hand#649] BroadcastHashJoin [inv_date_sk#646L], [d_date_sk#686L], BuildRight Project [i_item_id#651,w_warehouse_name#674,inv_date_sk#646L,inv_quantity_on_hand#649] BroadcastHashJoin [inv_warehouse_sk#648L], [w_warehouse_sk#672L], BuildRight Project [inv_warehouse_sk#648L,i_item_id#651,inv_date_sk#646L,inv_quantity_on_hand#649] BroadcastHashJoin [inv_item_sk#647L], [i_item_sk#650L], BuildRight HiveTableScan [inv_date_sk#646L,inv_item_sk#647L,inv_warehouse_sk#648L,inv_quantity_on_hand#649], (MetastoreRelation bigbenchorc, inventory, Some(inv)), None Project [i_item_id#651,i_item_sk#650L] Filter ((i_current_price#655 > 0.98) && (i_current_price#655 < 1.5)) HiveTableScan [i_item_id#651,i_item_sk#650L,i_current_price#655], (MetastoreRelation bigbenchorc, item, None), None HiveTableScan [w_warehouse_name#674,w_warehouse_sk#672L], (MetastoreRelation bigbenchorc, warehouse, Some(w)), None Filter ((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08) >= -30) && (HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08) <= 30)) HiveTableScan [d_date_sk#686L,d_date#688], (MetastoreRelation bigbenchorc, date_dim, Some(d)), None Time taken: 2.579 seconds > [Spark SQL] show poor performance when multiple table do join operation > ----------------------------------------------------------------------- > > Key: SPARK-5791 > URL: https://issues.apache.org/jira/browse/SPARK-5791 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.2.0 > Reporter: Yi Zhou > Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt > > > Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org