[ 
https://issues.apache.org/jira/browse/SPARK-8081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-8081.
---------------------------------
    Resolution: Invalid

I hope there are no such JIRAs, like, copying and just pasting the production 
query. This is really painful to reproduce such problem.

To verify this, it seems I need to install sqoop, mysql and Hive. Even the 
script itself is not executable as is. 

So, the problem is 

{quote}
As you see, there is a cartesian product, which gives me "outofmemory" 
exception even on such a small database as this one.
{quote}

?

I strongly feel it should go to mailing list as a question before we consider 
this as an issue (there is no info about your environment, settings and 
setups). Please reopen _if anyone can reproduce this_, except for the author of 
this JIRA who seems not active more than one and half years.

> Problems with Optimized Logical Plan Generation
> -----------------------------------------------
>
>                 Key: SPARK-8081
>                 URL: https://issues.apache.org/jira/browse/SPARK-8081
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.2, 1.3.1
>            Reporter: Ihor Bobak
>              Labels: performance
>
> You can reproduce the issue on this database:
> https://drive.google.com/file/d/0B3DMXMfcPWF3UEtsTjJ3aGMzOEE/view
> This is a Foodmart script for mysql, and a script for importing this database 
> to Hive.
> The problem is next.  When we run the following query on Spark SQL:
> SELECT `time_by_day`.`the_year` `c0`
>       ,`product_class`.`product_family` `c1`
>       ,SUM(`sales_fact_1997`.`unit_sales`) `m0`
>       ,SUM(`sales_fact_1997`.`store_cost`) `m1`
>       ,SUM(`sales_fact_1997`.`store_sales`) `m2`
>       ,COUNT(`sales_fact_1997`.`product_id`) `m3`
>       ,COUNT(DISTINCT `sales_fact_1997`.`customer_id`) `m4`
>       ,SUM((
>                       CASE 
>                               WHEN `sales_fact_1997`.`promotion_id` = 0
>                                       THEN 0
>                               ELSE `sales_fact_1997`.`store_sales`
>                               END
>                       )) `m5`
> FROM `foodmart`.`time_by_day` `time_by_day`
> CROSS JOIN `foodmart`.`sales_fact_1997` `sales_fact_1997`
> CROSS JOIN `foodmart`.`product` `product`
> CROSS JOIN `foodmart`.`product_class` `product_class`
> WHERE 
>                       `sales_fact_1997`.`time_id` = `time_by_day`.`time_id`
>                       AND `sales_fact_1997`.`product_id` = 
> `product`.`product_id`
>                       AND `product`.`product_class_id` = 
> `product_class`.`product_class_id`
> GROUP BY `time_by_day`.`the_year`
>       ,`product_class`.`product_family`
> the plan will be OK:
> == Optimized Logical Plan ==
> Aggregate [the_year#51,product_family#84], [the_year#51 AS 
> c0#0,product_family#84 AS c1#1,SUM(unit_sales#64) AS m0#2,SUM(store_cost#63) 
> AS m1#3,SUM(store_sales#62) AS m2#4,COUNT(product_id#57) AS 
> m3#5L,COUNT(DISTINCT customer_id#59) AS m4#6L,SUM(CASE WHEN (promotion_id#60 
> = 0) THEN 0.0 ELSE store_sales#62) AS m5#7]
>  Project 
> [store_cost#63,the_year#51,store_sales#62,product_family#84,unit_sales#64,customer_id#59,product_id#57,promotion_id#60]
>   Join Inner, Some((product_class_id#65 = product_class_id#80))
>    Project 
> [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_class_id#65,product_id#57,promotion_id#60]
>     Join Inner, Some((product_id#57 = product_id#66))
>      Project 
> [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,promotion_id#60]
>       Join Inner, Some((time_id#58 = time_id#47))
>        Project [time_id#47,the_year#51]
>         MetastoreRelation foodmart, time_by_day, Some(time_by_day)
>        Project 
> [store_cost#63,store_sales#62,unit_sales#64,customer_id#59,product_id#57,time_id#58,promotion_id#60]
>         MetastoreRelation foodmart, sales_fact_1997, Some(sales_fact_1997)
>      Project [product_id#66,product_class_id#65]
>       MetastoreRelation foodmart, product, Some(product)
>    Project [product_family#84,product_class_id#80]
>     MetastoreRelation foodmart, product_class, Some(product_class)
> == Physical Plan ==
> Aggregate false, [the_year#51,product_family#84], [the_year#51 AS 
> c0#0,product_family#84 AS c1#1,SUM(PartialSum#91) AS m0#2,SUM(PartialSum#92) 
> AS m1#3,SUM(PartialSum#93) AS m2#4,Coalesce(SUM(PartialCount#94L),0) AS 
> m3#5L,CombineAndCount(partialSets#95) AS m4#6L,SUM(PartialSum#96) AS m5#7]
>  Exchange (HashPartitioning [the_year#51,product_family#84], 200)
>   Aggregate true, [the_year#51,product_family#84], 
> [the_year#51,product_family#84,SUM(store_cost#63) AS 
> PartialSum#92,AddToHashSet(customer_id#59) AS 
> partialSets#95,SUM(store_sales#62) AS PartialSum#93,SUM(CASE WHEN 
> (promotion_id#60 = 0) THEN 0.0 ELSE store_sales#62) AS 
> PartialSum#96,SUM(unit_sales#64) AS PartialSum#91,COUNT(product_id#57) AS 
> PartialCount#94L]
>    Project 
> [store_cost#63,the_year#51,store_sales#62,product_family#84,unit_sales#64,customer_id#59,product_id#57,promotion_id#60]
>     BroadcastHashJoin [product_class_id#65], [product_class_id#80], BuildRight
>      Project 
> [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_class_id#65,product_id#57,promotion_id#60]
>       BroadcastHashJoin [product_id#57], [product_id#66], BuildRight
>        Project 
> [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,promotion_id#60]
>         BroadcastHashJoin [time_id#47], [time_id#58], BuildRight
>          HiveTableScan [time_id#47,the_year#51], (MetastoreRelation foodmart, 
> time_by_day, Some(time_by_day)), None
>          HiveTableScan 
> [store_cost#63,store_sales#62,unit_sales#64,customer_id#59,product_id#57,time_id#58,promotion_id#60],
>  (MetastoreRelation foodmart, sales_fact_1997, Some(sales_fact_1997)), None
>        HiveTableScan [product_id#66,product_class_id#65], (MetastoreRelation 
> foodmart, product, Some(product)), None
>      HiveTableScan [product_family#84,product_class_id#80], 
> (MetastoreRelation foodmart, product_class, Some(product_class)), None
> But as soon as we run the same query, but with order of tables product and 
> product_class changed:
> SELECT `time_by_day`.`the_year` `c0`
>       ,`product_class`.`product_family` `c1`
>       ,SUM(`sales_fact_1997`.`unit_sales`) `m0`
>       ,SUM(`sales_fact_1997`.`store_cost`) `m1`
>       ,SUM(`sales_fact_1997`.`store_sales`) `m2`
>       ,COUNT(`sales_fact_1997`.`product_id`) `m3`
>       ,COUNT(DISTINCT `sales_fact_1997`.`customer_id`) `m4`
>       ,SUM((
>                       CASE 
>                               WHEN `sales_fact_1997`.`promotion_id` = 0
>                                       THEN 0
>                               ELSE `sales_fact_1997`.`store_sales`
>                               END
>                       )) `m5`
> FROM `foodmart`.`time_by_day` `time_by_day`
> CROSS JOIN `foodmart`.`sales_fact_1997` `sales_fact_1997`
> CROSS JOIN `foodmart`.`product_class` `product_class`
> CROSS JOIN `foodmart`.`product` `product`
> WHERE 
>                       `sales_fact_1997`.`time_id` = `time_by_day`.`time_id`
>                       AND `sales_fact_1997`.`product_id` = 
> `product`.`product_id`
>                       AND `product`.`product_class_id` = 
> `product_class`.`product_class_id`
> GROUP BY `time_by_day`.`the_year`
>       ,`product_class`.`product_family`
> we will get this:
> == Optimized Logical Plan ==
> Aggregate [the_year#51,product_family#69], [the_year#51 AS 
> c0#0,product_family#69 AS c1#1,SUM(unit_sales#64) AS m0#2,SUM(store_cost#63) 
> AS m1#3,SUM(store_sales#62) AS m2#4,COUNT(product_id#57) AS 
> m3#5L,COUNT(DISTINCT customer_id#59) AS m4#6L,SUM(CASE WHEN (promotion_id#60 
> = 0) THEN 0.0 ELSE store_sales#62) AS m5#7]
>  Project 
> [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,product_family#69,promotion_id#60]
>   Join Inner, Some(((product_id#57 = product_id#71) && (product_class_id#70 = 
> product_class_id#65)))
>    Project 
> [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_class_id#65,product_id#57,product_family#69,promotion_id#60]
>     Join Inner, None
>      Project 
> [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,promotion_id#60]
>       Join Inner, Some((time_id#58 = time_id#47))
>        Project [time_id#47,the_year#51]
>         MetastoreRelation foodmart, time_by_day, Some(time_by_day)
>        Project 
> [store_cost#63,store_sales#62,unit_sales#64,customer_id#59,product_id#57,time_id#58,promotion_id#60]
>         MetastoreRelation foodmart, sales_fact_1997, Some(sales_fact_1997)
>      Project [product_family#69,product_class_id#65]
>       MetastoreRelation foodmart, product_class, Some(product_class)
>    Project [product_id#71,product_class_id#70]
>     MetastoreRelation foodmart, product, Some(product)
> == Physical Plan ==
> Aggregate false, [the_year#51,product_family#69], [the_year#51 AS 
> c0#0,product_family#69 AS c1#1,SUM(PartialSum#91) AS m0#2,SUM(PartialSum#92) 
> AS m1#3,SUM(PartialSum#93) AS m2#4,Coalesce(SUM(PartialCount#94L),0) AS 
> m3#5L,CombineAndCount(partialSets#95) AS m4#6L,SUM(PartialSum#96) AS m5#7]
>  Exchange (HashPartitioning [the_year#51,product_family#69], 200)
>   Aggregate true, [the_year#51,product_family#69], 
> [the_year#51,product_family#69,SUM(store_cost#63) AS 
> PartialSum#92,AddToHashSet(customer_id#59) AS 
> partialSets#95,SUM(store_sales#62) AS PartialSum#93,SUM(CASE WHEN 
> (promotion_id#60 = 0) THEN 0.0 ELSE store_sales#62) AS 
> PartialSum#96,SUM(unit_sales#64) AS PartialSum#91,COUNT(product_id#57) AS 
> PartialCount#94L]
>    Project 
> [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,product_family#69,promotion_id#60]
>     BroadcastHashJoin [product_id#57,product_class_id#65], 
> [product_id#71,product_class_id#70], BuildRight
>      Project 
> [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_class_id#65,product_id#57,product_family#69,promotion_id#60]
>       CartesianProduct
>        Project 
> [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,promotion_id#60]
>         BroadcastHashJoin [time_id#47], [time_id#58], BuildRight
>          HiveTableScan [time_id#47,the_year#51], (MetastoreRelation foodmart, 
> time_by_day, Some(time_by_day)), None
>          HiveTableScan 
> [store_cost#63,store_sales#62,unit_sales#64,customer_id#59,product_id#57,time_id#58,promotion_id#60],
>  (MetastoreRelation foodmart, sales_fact_1997, Some(sales_fact_1997)), None
>        HiveTableScan [product_family#69,product_class_id#65], 
> (MetastoreRelation foodmart, product_class, Some(product_class)), None
>      HiveTableScan [product_id#71,product_class_id#70], (MetastoreRelation 
> foodmart, product, Some(product)), None
> As you see, there is a cartesian product, which gives me "outofmemory" 
> exception even on such a small database as this one.
> The query was generated by Mondrian, it is not human-generated.  But I tested 
> the same query on MySQL - it works JUST FINE. And on Hive2 this query also 
> works fine, independently what is the order of the tables in the "from" 
> clause.
> Could you please fix this in the future Spark versions?
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to