[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491864#comment-14491864 ] Yi Zhou edited comment on SPARK-5791 at 4/13/15 2:57 AM: - We changed file format from ORC to Parquet and test based the latest spark code(1.4.0-SNAPSHOT). Got the result like below: Spark SQL(2m28s) vs. Hive (3m12s) was (Author: jameszhouyi): We changed file format from ORC to Parquet. Got the result like below: Spark SQL(2m28s) vs. Hive (3m12s) [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348293#comment-14348293 ] Cheng Hao edited comment on SPARK-5791 at 3/5/15 7:08 AM: -- I think this is a typical case that we need to optimize the join for the dimension tables, as they have lots of the data are filtered out with the join condition. In this case it's possible most of records in the factor table 'inv' are filtered for the join condition of {panel} JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk WHERE datediff(d_date, '2001-05-08') = -30 AND datediff(d_date, '2001-05-08') = 30 {panel} was (Author: chenghao): I think this is a typical case that we need to optimize the join for the dimension tables, as they have lots of the data are filtered out with the join condition. In this case it's possible most of data are filtered for the join condition of {panel} JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk WHERE datediff(d_date, '2001-05-08') = -30 AND datediff(d_date, '2001-05-08') = 30 {panel} [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348293#comment-14348293 ] Cheng Hao edited comment on SPARK-5791 at 3/5/15 7:07 AM: -- I think this is a typical case that we need to optimize the join for the dimension tables, as they have lots of the data are filtered out with the join condition. In this case it's possible most of data are filtered for the join condition of {panel} JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk WHERE datediff(d_date, '2001-05-08') = -30 AND datediff(d_date, '2001-05-08') = 30 {panel} was (Author: chenghao): I think this is a typical case that we need to optimize the join for the dimension tables, as they have lots of the data are filtered out with the join condition. In this case it's possible most of data are filtered for the join condition of {panel} JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk WHERE datediff(d_date, '2001-05-08') = -30 AND datediff(d_date, '2001-05-08') = 30 {/panel} [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348324#comment-14348324 ] Yi Zhou edited comment on SPARK-5791 at 3/5/15 7:40 AM: Thank you [~yhuai]. Updated SparkSQL physical plan with below parameters with great improved performance. But from latest test results, the query still get slow compared with Hive on M/R (~6min vs ~2min) spark.sql.shuffle.partitions=200; spark.sql.autoBroadcastJoinThreshold=209715200; spark.serializer=org.apache.spark.serializer.KryoSerializer was (Author: jameszhouyi): Thank you [~yhuai]. Updated SparkSQL physical plan with below parameters with great improved performance. But from latest test results, the query still get slow compared with Hive on M/R (~6min vs ~2min) spark.sql.shuffle.partitions=200; spark.sql.autoBroadcastJoinThreshold=209715200; spark.serializer=org.apache.spark.serializer.KryoSerializer == Physical Plan == InsertIntoHiveTable (MetastoreRelation bigbenchorc, q22_spark_run_query_0_result, None), Map(), false Sort [w_warehouse_name#674 ASC,i_item_id#651 ASC], false Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200) Filter (((inv_before#635L 0) ((CAST(inv_after#636L, DoubleType) / CAST(inv_before#635L, DoubleType)) = 0.)) ((CAST(inv_after#636L, DoubleType) / CAST(inv_before#635L, DoubleType)) = 1.5)) Aggregate false, [w_warehouse_name#674,i_item_id#651], [w_warehouse_name#674,i_item_id#651,SUM(PartialSum#716L) AS inv_before#635L,SUM(PartialSum#717L) AS inv_after#636L] Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200) Aggregate true, [w_warehouse_name#674,i_item_id#651], [w_warehouse_name#674,i_item_id#651,SUM(CAST(CASE WHEN (HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08) 0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS PartialSum#716L,SUM(CAST(CASE WHEN (HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08) = 0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS PartialSum#717L] Project [w_warehouse_name#674,i_item_id#651,d_date#688,inv_quantity_on_hand#649] BroadcastHashJoin [inv_date_sk#646L], [d_date_sk#686L], BuildRight Project [i_item_id#651,w_warehouse_name#674,inv_date_sk#646L,inv_quantity_on_hand#649] BroadcastHashJoin [inv_warehouse_sk#648L], [w_warehouse_sk#672L], BuildRight Project [inv_warehouse_sk#648L,i_item_id#651,inv_date_sk#646L,inv_quantity_on_hand#649] BroadcastHashJoin [inv_item_sk#647L], [i_item_sk#650L], BuildRight HiveTableScan [inv_date_sk#646L,inv_item_sk#647L,inv_warehouse_sk#648L,inv_quantity_on_hand#649], (MetastoreRelation bigbenchorc, inventory, Some(inv)), None Project [i_item_id#651,i_item_sk#650L] Filter ((i_current_price#655 0.98) (i_current_price#655 1.5)) HiveTableScan [i_item_id#651,i_item_sk#650L,i_current_price#655], (MetastoreRelation bigbenchorc, item, None), None HiveTableScan [w_warehouse_name#674,w_warehouse_sk#672L], (MetastoreRelation bigbenchorc, warehouse, Some(w)), None Filter ((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08) = -30) (HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08) = 30)) HiveTableScan [d_date_sk#686L,d_date#688], (MetastoreRelation bigbenchorc, date_dim, Some(d)), None Time taken: 2.579 seconds [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319364#comment-14319364 ] Reynold Xin edited comment on SPARK-5791 at 3/3/15 5:05 AM: {code} INSERT INTO TABLE q22_spark_RUN_QUERY_0_result SELECT * FROM ( SELECT w_warehouse_name, i_item_id, SUM( CASE WHEN datediff(d_date, '2001-05-08') 0 THEN inv_quantity_on_hand ELSE 0 END ) AS inv_before, SUM( CASE WHEN datediff(d_date, '2001-05-08') = 0 THEN inv_quantity_on_hand ELSE 0 END ) AS inv_after FROM ( SELECT * FROM inventory inv JOIN ( SELECT i_item_id, i_item_sk FROM item WHERE i_current_price 0.98 AND i_current_price 1.5 ) items ON inv.inv_item_sk = items.i_item_sk JOIN warehouse w ON inv.inv_warehouse_sk = w.w_warehouse_sk JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk WHERE datediff(d_date, '2001-05-08') = -30 AND datediff(d_date, '2001-05-08') = 30 ) q22_coalition_22 GROUP BY w_warehouse_name, i_item_id ) name WHERE inv_before 0 AND inv_after / inv_before = 2.0 / 3.0 AND inv_after / inv_before = 3.0 / 2.0 CLUSTER BY w_warehouse_name, i_item_id; {code} was (Author: jameszhouyi): INSERT INTO TABLE q22_spark_RUN_QUERY_0_result SELECT * FROM ( SELECT w_warehouse_name, i_item_id, SUM( CASE WHEN datediff(d_date, '2001-05-08') 0 THEN inv_quantity_on_hand ELSE 0 END ) AS inv_before, SUM( CASE WHEN datediff(d_date, '2001-05-08') = 0 THEN inv_quantity_on_hand ELSE 0 END ) AS inv_after FROM ( SELECT * FROM inventory inv JOIN ( SELECT i_item_id, i_item_sk FROM item WHERE i_current_price 0.98 AND i_current_price 1.5 ) items ON inv.inv_item_sk = items.i_item_sk JOIN warehouse w ON inv.inv_warehouse_sk = w.w_warehouse_sk JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk WHERE datediff(d_date, '2001-05-08') = -30 AND datediff(d_date, '2001-05-08') = 30 ) q22_coalition_22 GROUP BY w_warehouse_name, i_item_id ) name WHERE inv_before 0 AND inv_after / inv_before = 2.0 / 3.0 AND inv_after / inv_before = 3.0 / 2.0 CLUSTER BY w_warehouse_name, i_item_id; [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341095#comment-14341095 ] Yi Zhou edited comment on SPARK-5791 at 2/28/15 1:36 AM: - Add tables size info: ~4.9 GB 'inventory' table(Fact) ~73.5 MB 'item' table(Dimension) ~3.1 KB 'warehouse' table(Dimension) ~1.7MB 'date_dim' table(Dimension) was (Author: jameszhouyi): Add tables size info: ~4.9 GB 'inventory' table ~73.5 MB 'item' table ~3.1 KB 'warehouse' table ~1.7MB 'date_dim' table [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319364#comment-14319364 ] Yi Zhou edited comment on SPARK-5791 at 2/28/15 12:39 AM: -- INSERT INTO TABLE q22_spark_RUN_QUERY_0_result SELECT * FROM ( SELECT w_warehouse_name, i_item_id, SUM( CASE WHEN datediff(d_date, '2001-05-08') 0 THEN inv_quantity_on_hand ELSE 0 END ) AS inv_before, SUM( CASE WHEN datediff(d_date, '2001-05-08') = 0 THEN inv_quantity_on_hand ELSE 0 END ) AS inv_after FROM ( SELECT * FROM inventory inv JOIN ( SELECT i_item_id, i_item_sk FROM item WHERE i_current_price 0.98 AND i_current_price 1.5 ) items ON inv.inv_item_sk = items.i_item_sk JOIN warehouse w ON inv.inv_warehouse_sk = w.w_warehouse_sk JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk WHERE datediff(d_date, '2001-05-08') = -30 AND datediff(d_date, '2001-05-08') = 30 ) q22_coalition_22 GROUP BY w_warehouse_name, i_item_id ) name WHERE inv_before 0 AND inv_after / inv_before = 2.0 / 3.0 AND inv_after / inv_before = 3.0 / 2.0 CLUSTER BY w_warehouse_name, i_item_id; was (Author: jameszhouyi): For example: SELECT * FROM inventory inv JOIN ( SELECT i_item_id, i_item_sk FROM item WHERE i_current_price 0.98 AND i_current_price 1.5 ) items ON inv.inv_item_sk = items.i_item_sk JOIN warehouse w ON inv.inv_warehouse_sk = w.w_warehouse_sk JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk WHERE datediff(d_date, '2001-05-08') = -30 AND datediff(d_date, '2001-05-08') = 30; [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321777#comment-14321777 ] Yi Zhou edited comment on SPARK-5791 at 2/15/15 1:49 AM: - For the same input data set size(e.g.,1TB), it costs about ~2mins on hive on M/R with optimization parameters but it costs about ~1hour on SparkSQL. was (Author: jameszhouyi): For the same input dataset size, it costs about ~2mins on hive on M/R with optimization parameters but it costs about ~1hour on SparkSQL. [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org