[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-04-12 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491864#comment-14491864
 ] 

Yi Zhou edited comment on SPARK-5791 at 4/13/15 2:57 AM:
-

We changed file format from ORC to Parquet and test based the latest spark 
code(1.4.0-SNAPSHOT).  Got the result like below:
Spark SQL(2m28s) vs. Hive (3m12s)


was (Author: jameszhouyi):
We changed file format from ORC to Parquet.  Got the result like below:
Spark SQL(2m28s) vs. Hive (3m12s)

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou
 Attachments: Physcial_Plan_Hive.txt, 
 Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt


 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-03-04 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348293#comment-14348293
 ] 

Cheng Hao edited comment on SPARK-5791 at 3/5/15 7:08 AM:
--

I think this is a typical case that we need to optimize the join for the 
dimension tables, as they have lots of the data are filtered out with the join 
condition.

In this case it's possible most of records in the factor table 'inv' are 
filtered for the join condition of 
{panel}
JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk
WHERE datediff(d_date, '2001-05-08') = -30
AND datediff(d_date, '2001-05-08') = 30
{panel}


was (Author: chenghao):
I think this is a typical case that we need to optimize the join for the 
dimension tables, as they have lots of the data are filtered out with the join 
condition.

In this case it's possible most of data are filtered for the join condition of 
{panel}
JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk
WHERE datediff(d_date, '2001-05-08') = -30
AND datediff(d_date, '2001-05-08') = 30
{panel}

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou
 Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt


 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-03-04 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348293#comment-14348293
 ] 

Cheng Hao edited comment on SPARK-5791 at 3/5/15 7:07 AM:
--

I think this is a typical case that we need to optimize the join for the 
dimension tables, as they have lots of the data are filtered out with the join 
condition.

In this case it's possible most of data are filtered for the join condition of 
{panel}
JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk
WHERE datediff(d_date, '2001-05-08') = -30
AND datediff(d_date, '2001-05-08') = 30
{panel}


was (Author: chenghao):
I think this is a typical case that we need to optimize the join for the 
dimension tables, as they have lots of the data are filtered out with the join 
condition.

In this case it's possible most of data are filtered for the join condition of 
{panel}
JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk
WHERE datediff(d_date, '2001-05-08') = -30
AND datediff(d_date, '2001-05-08') = 30
{/panel}

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou
 Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt


 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-03-04 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348324#comment-14348324
 ] 

Yi Zhou edited comment on SPARK-5791 at 3/5/15 7:40 AM:


Thank you [~yhuai]. Updated SparkSQL physical plan with  below parameters with 
great improved performance. But from latest test results, the query still get 
slow compared with Hive on M/R (~6min vs ~2min)
spark.sql.shuffle.partitions=200;
spark.sql.autoBroadcastJoinThreshold=209715200;
spark.serializer=org.apache.spark.serializer.KryoSerializer



was (Author: jameszhouyi):
Thank you [~yhuai]. Updated SparkSQL physical plan with  below parameters with 
great improved performance. But from latest test results, the query still get 
slow compared with Hive on M/R (~6min vs ~2min)
spark.sql.shuffle.partitions=200;
spark.sql.autoBroadcastJoinThreshold=209715200;
spark.serializer=org.apache.spark.serializer.KryoSerializer

== Physical Plan ==
InsertIntoHiveTable (MetastoreRelation bigbenchorc, 
q22_spark_run_query_0_result, None), Map(), false
 Sort [w_warehouse_name#674 ASC,i_item_id#651 ASC], false
  Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200)
   Filter (((inv_before#635L  0)  ((CAST(inv_after#636L, DoubleType) / 
CAST(inv_before#635L, DoubleType)) = 0.))  
((CAST(inv_after#636L, DoubleType) / CAST(inv_before#635L, DoubleType)) = 1.5))
Aggregate false, [w_warehouse_name#674,i_item_id#651], 
[w_warehouse_name#674,i_item_id#651,SUM(PartialSum#716L) AS 
inv_before#635L,SUM(PartialSum#717L) AS inv_after#636L]
 Exchange (HashPartitioning [w_warehouse_name#674,i_item_id#651], 200)
  Aggregate true, [w_warehouse_name#674,i_item_id#651], 
[w_warehouse_name#674,i_item_id#651,SUM(CAST(CASE WHEN 
(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
  0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS 
PartialSum#716L,SUM(CAST(CASE WHEN 
(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
 = 0) THEN inv_quantity_on_hand#649 ELSE 0, LongType)) AS PartialSum#717L]
   Project 
[w_warehouse_name#674,i_item_id#651,d_date#688,inv_quantity_on_hand#649]
BroadcastHashJoin [inv_date_sk#646L], [d_date_sk#686L], BuildRight
 Project 
[i_item_id#651,w_warehouse_name#674,inv_date_sk#646L,inv_quantity_on_hand#649]
  BroadcastHashJoin [inv_warehouse_sk#648L], [w_warehouse_sk#672L], 
BuildRight
   Project 
[inv_warehouse_sk#648L,i_item_id#651,inv_date_sk#646L,inv_quantity_on_hand#649]
BroadcastHashJoin [inv_item_sk#647L], [i_item_sk#650L], BuildRight
 HiveTableScan 
[inv_date_sk#646L,inv_item_sk#647L,inv_warehouse_sk#648L,inv_quantity_on_hand#649],
 (MetastoreRelation bigbenchorc, inventory, Some(inv)), None
 Project [i_item_id#651,i_item_sk#650L]
  Filter ((i_current_price#655  0.98)  (i_current_price#655  
1.5))
   HiveTableScan 
[i_item_id#651,i_item_sk#650L,i_current_price#655], (MetastoreRelation 
bigbenchorc, item, None), None
   HiveTableScan [w_warehouse_name#674,w_warehouse_sk#672L], 
(MetastoreRelation bigbenchorc, warehouse, Some(w)), None
 Filter 
((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
 = -30)  
(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFDateDiff(d_date#688,2001-05-08)
 = 30))
  HiveTableScan [d_date_sk#686L,d_date#688], (MetastoreRelation 
bigbenchorc, date_dim, Some(d)), None
Time taken: 2.579 seconds


 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou
 Attachments: Physcial_Plan_Hive.txt, Physical_Plan.txt


 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-03-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319364#comment-14319364
 ] 

Reynold Xin edited comment on SPARK-5791 at 3/3/15 5:05 AM:


{code}
INSERT INTO TABLE q22_spark_RUN_QUERY_0_result
SELECT *
FROM (
  SELECT
w_warehouse_name,
i_item_id,
SUM(
  CASE WHEN datediff(d_date, '2001-05-08')  0
  THEN inv_quantity_on_hand
  ELSE 0 END
) AS inv_before,
SUM(
  CASE WHEN datediff(d_date, '2001-05-08') = 0
  THEN inv_quantity_on_hand
  ELSE 0 END
) AS inv_after
  FROM (
SELECT *
FROM inventory inv
JOIN (
  SELECT
i_item_id,
i_item_sk
  FROM item
  WHERE i_current_price  0.98
  AND i_current_price  1.5
) items
ON inv.inv_item_sk = items.i_item_sk
JOIN warehouse w ON inv.inv_warehouse_sk = w.w_warehouse_sk
JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk
WHERE datediff(d_date, '2001-05-08') = -30
AND datediff(d_date, '2001-05-08') = 30
  ) q22_coalition_22
  GROUP BY w_warehouse_name, i_item_id
) name
WHERE inv_before  0
AND inv_after / inv_before = 2.0 / 3.0
AND inv_after / inv_before = 3.0 / 2.0
CLUSTER BY
  w_warehouse_name,
  i_item_id;
{code}



was (Author: jameszhouyi):
INSERT INTO TABLE q22_spark_RUN_QUERY_0_result
SELECT *
FROM (
  SELECT
w_warehouse_name,
i_item_id,
SUM(
  CASE WHEN datediff(d_date, '2001-05-08')  0
  THEN inv_quantity_on_hand
  ELSE 0 END
) AS inv_before,
SUM(
  CASE WHEN datediff(d_date, '2001-05-08') = 0
  THEN inv_quantity_on_hand
  ELSE 0 END
) AS inv_after
  FROM (
SELECT *
FROM inventory inv
JOIN (
  SELECT
i_item_id,
i_item_sk
  FROM item
  WHERE i_current_price  0.98
  AND i_current_price  1.5
) items
ON inv.inv_item_sk = items.i_item_sk
JOIN warehouse w ON inv.inv_warehouse_sk = w.w_warehouse_sk
JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk
WHERE datediff(d_date, '2001-05-08') = -30
AND datediff(d_date, '2001-05-08') = 30
  ) q22_coalition_22
  GROUP BY w_warehouse_name, i_item_id
) name
WHERE inv_before  0
AND inv_after / inv_before = 2.0 / 3.0
AND inv_after / inv_before = 3.0 / 2.0
CLUSTER BY
  w_warehouse_name,
  i_item_id;

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou
 Attachments: Physical_Plan.txt


 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-02-27 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341095#comment-14341095
 ] 

Yi Zhou edited comment on SPARK-5791 at 2/28/15 1:36 AM:
-

Add tables size info:
~4.9 GB 'inventory' table(Fact)
~73.5 MB 'item' table(Dimension)
~3.1 KB 'warehouse'  table(Dimension)
~1.7MB 'date_dim' table(Dimension)


was (Author: jameszhouyi):
Add tables size info:
~4.9 GB 'inventory' table 
~73.5 MB 'item' table
~3.1 KB 'warehouse'  table
~1.7MB 'date_dim' table

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou
 Attachments: Physical_Plan.txt


 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-02-27 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319364#comment-14319364
 ] 

Yi Zhou edited comment on SPARK-5791 at 2/28/15 12:39 AM:
--

INSERT INTO TABLE q22_spark_RUN_QUERY_0_result
SELECT *
FROM (
  SELECT
w_warehouse_name,
i_item_id,
SUM(
  CASE WHEN datediff(d_date, '2001-05-08')  0
  THEN inv_quantity_on_hand
  ELSE 0 END
) AS inv_before,
SUM(
  CASE WHEN datediff(d_date, '2001-05-08') = 0
  THEN inv_quantity_on_hand
  ELSE 0 END
) AS inv_after
  FROM (
SELECT *
FROM inventory inv
JOIN (
  SELECT
i_item_id,
i_item_sk
  FROM item
  WHERE i_current_price  0.98
  AND i_current_price  1.5
) items
ON inv.inv_item_sk = items.i_item_sk
JOIN warehouse w ON inv.inv_warehouse_sk = w.w_warehouse_sk
JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk
WHERE datediff(d_date, '2001-05-08') = -30
AND datediff(d_date, '2001-05-08') = 30
  ) q22_coalition_22
  GROUP BY w_warehouse_name, i_item_id
) name
WHERE inv_before  0
AND inv_after / inv_before = 2.0 / 3.0
AND inv_after / inv_before = 3.0 / 2.0
CLUSTER BY
  w_warehouse_name,
  i_item_id;


was (Author: jameszhouyi):
For example:
SELECT *
FROM inventory inv
JOIN (
  SELECT
i_item_id,
i_item_sk
  FROM item
  WHERE i_current_price  0.98
  AND i_current_price  1.5
) items
ON inv.inv_item_sk = items.i_item_sk
JOIN warehouse w ON inv.inv_warehouse_sk = w.w_warehouse_sk
JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk
WHERE datediff(d_date, '2001-05-08') = -30
AND datediff(d_date, '2001-05-08') = 30;

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou
 Attachments: Physical_Plan.txt


 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-02-15 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321777#comment-14321777
 ] 

Yi Zhou edited comment on SPARK-5791 at 2/15/15 1:49 AM:
-

For the same input data set size(e.g.,1TB), it costs about ~2mins on hive on 
M/R with optimization parameters but it costs about ~1hour on SparkSQL.


was (Author: jameszhouyi):
For the same input dataset size, it costs about ~2mins on hive on M/R with 
optimization parameters but it costs about ~1hour on SparkSQL.

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou

 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org