[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734721#comment-14734721 ] Yi Zhou commented on SPARK-5791: [~yhuai], Yes. Thank you ! > [Spark SQL] show poor performance when multiple table do join operation > --- > > Key: SPARK-5791 > URL: https://issues.apache.org/jira/browse/SPARK-5791 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Yi Zhou > Attachments: Physcial_Plan_Hive.txt, > Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt > > > Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644689#comment-14644689 ] Yin Huai commented on SPARK-5791: - [~jameszhouyi] So, the performance issue of join operation in your test has been resolved? [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493350#comment-14493350 ] Yi Zhou commented on SPARK-5791: [~yhuai], yes, Both used Parquet. [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492618#comment-14492618 ] Yin Huai commented on SPARK-5791: - [~jameszhouyi] Thank you for the update :) For Hive, it also used Parquet in your last run, right? [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491864#comment-14491864 ] Yi Zhou commented on SPARK-5791: We changed file format from ORC to Parquet. Got the result like below: Spark SQL(2m28s) vs. Hive (3m12s) [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350396#comment-14350396 ] Yi Zhou commented on SPARK-5791: About 3.7MB in size for the result of 'name' subquery [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349139#comment-14349139 ] Yin Huai commented on SPARK-5791: - [~jameszhouyi] Thank you for the updated physical plan. What is the file format used for those tables? ORC or Parquet? Also, what is the version of Spark? If Parquet is used, HiveTableScan is not as efficient as our native parquet support (ParquetRelation2 in Spark SQL. Actually, if you are using Spark 1.3 and data is stored as Parquet, you should not see HiveTableScan when reading parquet data). [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349183#comment-14349183 ] Yin Huai commented on SPARK-5791: - Also, how large is the results of name subquery? [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349929#comment-14349929 ] Yi Zhou commented on SPARK-5791: [~yhuai] Currently all of input tables are ORC file format. We used CDH5.3.0 Spark-1.2 when testing such query. [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348293#comment-14348293 ] Cheng Hao commented on SPARK-5791: -- I think this is a typical case that we need to optimize the join for the dimension tables, as they have lots of the data are filtered out with the join condition. In this case it's possible most of data are filtered for the join condition of {panel} JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk WHERE datediff(d_date, '2001-05-08') = -30 AND datediff(d_date, '2001-05-08') = 30 {/panel} [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348225#comment-14348225 ] Yin Huai commented on SPARK-5791: - I see. In Hive's plan, all of item, warehouse, and date_dim are broadcast tables. However, in Spark SQL's plan, the join between item and inventory was a shuffle join. Can you set the value of spark.sql.autoBroadcastJoinThreshold larger than the size of item? Also, what is the value of spark.serializer? Using org.apache.spark.serializer.KryoSerializer for spark.serializer will also help the performance (we will use Kryo to serialize broadcast tables). [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348324#comment-14348324 ] Yi Zhou commented on SPARK-5791: Thank you [~yhuai]. Updated SparkSQL physical plan with below parameters with great improved performance. But from latest test results, the query still get slow compared with Hive on M/R (~6min vs ~2min) spark.sql.shuffle.partitions=200; spark.sql.autoBroadcastJoinThreshold=209715200; spark.serializer=org.apache.spark.serializer.KryoSerializer == Physical Plan == InsertIntoHiveTable (MetastoreRelation bigbenchorc, q22_spark_run_query_0_result, None), Map(), false Sort [w_warehouse_name#674 ASC,i_item_id#651 ASC], false Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200) Filter (((inv_before#635L 0) ((CAST(inv_after#636L, DoubleType) / CAST(inv_before#635L, DoubleType)) = 0.)) ((CAST(inv_after#636L, DoubleType) / CAST(inv_before#635L, DoubleType)) = 1.5)) Aggregate false, [w_warehouse_name#674,i_item_id#651], [w_warehouse_name#674,i_item_id#651,SUM(PartialSum#716L) AS inv_before#635L,SUM(PartialSum#717L) AS inv_after#636L] Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200) Aggregate true, [w_warehouse_name#674,i_item_id#651], [w_warehouse_name#674,i_item_id#651,SUM(CAST(CASE WHEN (HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08) 0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS PartialSum#716L,SUM(CAST(CASE WHEN (HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08) = 0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS PartialSum#717L] Project [w_warehouse_name#674,i_item_id#651,d_date#688,inv_quantity_on_hand#649] BroadcastHashJoin [inv_date_sk#646L], [d_date_sk#686L], BuildRight Project [i_item_id#651,w_warehouse_name#674,inv_date_sk#646L,inv_quantity_on_hand#649] BroadcastHashJoin [inv_warehouse_sk#648L], [w_warehouse_sk#672L], BuildRight Project [inv_warehouse_sk#648L,i_item_id#651,inv_date_sk#646L,inv_quantity_on_hand#649] BroadcastHashJoin [inv_item_sk#647L], [i_item_sk#650L], BuildRight HiveTableScan [inv_date_sk#646L,inv_item_sk#647L,inv_warehouse_sk#648L,inv_quantity_on_hand#649], (MetastoreRelation bigbenchorc, inventory, Some(inv)), None Project [i_item_id#651,i_item_sk#650L] Filter ((i_current_price#655 0.98) (i_current_price#655 1.5)) HiveTableScan [i_item_id#651,i_item_sk#650L,i_current_price#655], (MetastoreRelation bigbenchorc, item, None), None HiveTableScan [w_warehouse_name#674,w_warehouse_sk#672L], (MetastoreRelation bigbenchorc, warehouse, Some(w)), None Filter ((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08) = -30) (HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08) = 30)) HiveTableScan [d_date_sk#686L,d_date#688], (MetastoreRelation bigbenchorc, date_dim, Some(d)), None Time taken: 2.579 seconds [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348216#comment-14348216 ] Yi Zhou commented on SPARK-5791: Hi, [~yhuai] i attached the Physical Plan for Hive. Please kindly refer.. [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344486#comment-14344486 ] Yin Huai commented on SPARK-5791: - [~jameszhouyi] Can you also add the plan generated by Hive? [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341095#comment-14341095 ] Yi Zhou commented on SPARK-5791: Add tables size info: ~4.9 GB 'inventory' table ~73.5 MB 'item' table ~3.1 KB 'warehouse' table ~1.7MB 'date_dim' table [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321777#comment-14321777 ] Yi Zhou commented on SPARK-5791: For the same input dataset size, it costs about ~2mins on hive on M/R with optimization parameters but it costs about ~1hour on SparkSQL. [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319400#comment-14319400 ] Cheng Hao commented on SPARK-5791: -- Can you also attach the performance comparison result for this query? [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319364#comment-14319364 ] Yi Zhou commented on SPARK-5791: For example: SELECT * FROM inventory inv JOIN ( SELECT i_item_id, i_item_sk FROM item WHERE i_current_price 0.98 AND i_current_price 1.5 ) items ON inv.inv_item_sk = items.i_item_sk JOIN warehouse w ON inv.inv_warehouse_sk = w.w_warehouse_sk JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk WHERE datediff(d_date, '2001-05-08') = -30 AND datediff(d_date, '2001-05-08') = 30; [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Spark SQL shows poor performance when multiple tables do join operation compared with Hive on MapReduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org