Ihor Bobak created SPARK-8081:
---------------------------------

             Summary: Problems with Optimized Logical Plan Generation
                 Key: SPARK-8081
                 URL: https://issues.apache.org/jira/browse/SPARK-8081
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.3.1, 1.2.2
            Reporter: Ihor Bobak


You can reproduce the issue on this database:
https://drive.google.com/file/d/0B3DMXMfcPWF3UEtsTjJ3aGMzOEE/view
This is a Foodmart script for mysql, and a script for importing this database 
to Hive.

The problem is next.  When we run the following query on Spark SQL:

SELECT `time_by_day`.`the_year` `c0`
        ,`product_class`.`product_family` `c1`
        ,SUM(`sales_fact_1997`.`unit_sales`) `m0`
        ,SUM(`sales_fact_1997`.`store_cost`) `m1`
        ,SUM(`sales_fact_1997`.`store_sales`) `m2`
        ,COUNT(`sales_fact_1997`.`product_id`) `m3`
        ,COUNT(DISTINCT `sales_fact_1997`.`customer_id`) `m4`
        ,SUM((
                        CASE 
                                WHEN `sales_fact_1997`.`promotion_id` = 0
                                        THEN 0
                                ELSE `sales_fact_1997`.`store_sales`
                                END
                        )) `m5`
FROM `foodmart`.`time_by_day` `time_by_day`
CROSS JOIN `foodmart`.`sales_fact_1997` `sales_fact_1997`
CROSS JOIN `foodmart`.`product` `product`
CROSS JOIN `foodmart`.`product_class` `product_class`
WHERE 

                        `sales_fact_1997`.`time_id` = `time_by_day`.`time_id`
                        AND `sales_fact_1997`.`product_id` = 
`product`.`product_id`
                        AND `product`.`product_class_id` = 
`product_class`.`product_class_id`

GROUP BY `time_by_day`.`the_year`
        ,`product_class`.`product_family`



the plan will be OK:

== Optimized Logical Plan ==
Aggregate [the_year#51,product_family#84], [the_year#51 AS 
c0#0,product_family#84 AS c1#1,SUM(unit_sales#64) AS m0#2,SUM(store_cost#63) AS 
m1#3,SUM(store_sales#62) AS m2#4,COUNT(product_id#57) AS m3#5L,COUNT(DISTINCT 
customer_id#59) AS m4#6L,SUM(CASE WHEN (promotion_id#60 = 0) THEN 0.0 ELSE 
store_sales#62) AS m5#7]
 Project 
[store_cost#63,the_year#51,store_sales#62,product_family#84,unit_sales#64,customer_id#59,product_id#57,promotion_id#60]
  Join Inner, Some((product_class_id#65 = product_class_id#80))
   Project 
[store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_class_id#65,product_id#57,promotion_id#60]
    Join Inner, Some((product_id#57 = product_id#66))
     Project 
[store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,promotion_id#60]
      Join Inner, Some((time_id#58 = time_id#47))
       Project [time_id#47,the_year#51]
        MetastoreRelation foodmart, time_by_day, Some(time_by_day)
       Project 
[store_cost#63,store_sales#62,unit_sales#64,customer_id#59,product_id#57,time_id#58,promotion_id#60]
        MetastoreRelation foodmart, sales_fact_1997, Some(sales_fact_1997)
     Project [product_id#66,product_class_id#65]
      MetastoreRelation foodmart, product, Some(product)
   Project [product_family#84,product_class_id#80]
    MetastoreRelation foodmart, product_class, Some(product_class)

== Physical Plan ==
Aggregate false, [the_year#51,product_family#84], [the_year#51 AS 
c0#0,product_family#84 AS c1#1,SUM(PartialSum#91) AS m0#2,SUM(PartialSum#92) AS 
m1#3,SUM(PartialSum#93) AS m2#4,Coalesce(SUM(PartialCount#94L),0) AS 
m3#5L,CombineAndCount(partialSets#95) AS m4#6L,SUM(PartialSum#96) AS m5#7]
 Exchange (HashPartitioning [the_year#51,product_family#84], 200)
  Aggregate true, [the_year#51,product_family#84], 
[the_year#51,product_family#84,SUM(store_cost#63) AS 
PartialSum#92,AddToHashSet(customer_id#59) AS 
partialSets#95,SUM(store_sales#62) AS PartialSum#93,SUM(CASE WHEN 
(promotion_id#60 = 0) THEN 0.0 ELSE store_sales#62) AS 
PartialSum#96,SUM(unit_sales#64) AS PartialSum#91,COUNT(product_id#57) AS 
PartialCount#94L]
   Project 
[store_cost#63,the_year#51,store_sales#62,product_family#84,unit_sales#64,customer_id#59,product_id#57,promotion_id#60]
    BroadcastHashJoin [product_class_id#65], [product_class_id#80], BuildRight
     Project 
[store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_class_id#65,product_id#57,promotion_id#60]
      BroadcastHashJoin [product_id#57], [product_id#66], BuildRight
       Project 
[store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,promotion_id#60]
        BroadcastHashJoin [time_id#47], [time_id#58], BuildRight
         HiveTableScan [time_id#47,the_year#51], (MetastoreRelation foodmart, 
time_by_day, Some(time_by_day)), None
         HiveTableScan 
[store_cost#63,store_sales#62,unit_sales#64,customer_id#59,product_id#57,time_id#58,promotion_id#60],
 (MetastoreRelation foodmart, sales_fact_1997, Some(sales_fact_1997)), None
       HiveTableScan [product_id#66,product_class_id#65], (MetastoreRelation 
foodmart, product, Some(product)), None
     HiveTableScan [product_family#84,product_class_id#80], (MetastoreRelation 
foodmart, product_class, Some(product_class)), None





But as soon as we run the same query, but with order of tables product and 
product_class changed:


SELECT `time_by_day`.`the_year` `c0`
        ,`product_class`.`product_family` `c1`
        ,SUM(`sales_fact_1997`.`unit_sales`) `m0`
        ,SUM(`sales_fact_1997`.`store_cost`) `m1`
        ,SUM(`sales_fact_1997`.`store_sales`) `m2`
        ,COUNT(`sales_fact_1997`.`product_id`) `m3`
        ,COUNT(DISTINCT `sales_fact_1997`.`customer_id`) `m4`
        ,SUM((
                        CASE 
                                WHEN `sales_fact_1997`.`promotion_id` = 0
                                        THEN 0
                                ELSE `sales_fact_1997`.`store_sales`
                                END
                        )) `m5`
FROM `foodmart`.`time_by_day` `time_by_day`
CROSS JOIN `foodmart`.`sales_fact_1997` `sales_fact_1997`
CROSS JOIN `foodmart`.`product_class` `product_class`
CROSS JOIN `foodmart`.`product` `product`
WHERE 

                        `sales_fact_1997`.`time_id` = `time_by_day`.`time_id`
                        AND `sales_fact_1997`.`product_id` = 
`product`.`product_id`
                        AND `product`.`product_class_id` = 
`product_class`.`product_class_id`

GROUP BY `time_by_day`.`the_year`
        ,`product_class`.`product_family`


we will get this:


== Optimized Logical Plan ==
Aggregate [the_year#51,product_family#69], [the_year#51 AS 
c0#0,product_family#69 AS c1#1,SUM(unit_sales#64) AS m0#2,SUM(store_cost#63) AS 
m1#3,SUM(store_sales#62) AS m2#4,COUNT(product_id#57) AS m3#5L,COUNT(DISTINCT 
customer_id#59) AS m4#6L,SUM(CASE WHEN (promotion_id#60 = 0) THEN 0.0 ELSE 
store_sales#62) AS m5#7]
 Project 
[store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,product_family#69,promotion_id#60]
  Join Inner, Some(((product_id#57 = product_id#71) && (product_class_id#70 = 
product_class_id#65)))
   Project 
[store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_class_id#65,product_id#57,product_family#69,promotion_id#60]
    Join Inner, None
     Project 
[store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,promotion_id#60]
      Join Inner, Some((time_id#58 = time_id#47))
       Project [time_id#47,the_year#51]
        MetastoreRelation foodmart, time_by_day, Some(time_by_day)
       Project 
[store_cost#63,store_sales#62,unit_sales#64,customer_id#59,product_id#57,time_id#58,promotion_id#60]
        MetastoreRelation foodmart, sales_fact_1997, Some(sales_fact_1997)
     Project [product_family#69,product_class_id#65]
      MetastoreRelation foodmart, product_class, Some(product_class)
   Project [product_id#71,product_class_id#70]
    MetastoreRelation foodmart, product, Some(product)

== Physical Plan ==
Aggregate false, [the_year#51,product_family#69], [the_year#51 AS 
c0#0,product_family#69 AS c1#1,SUM(PartialSum#91) AS m0#2,SUM(PartialSum#92) AS 
m1#3,SUM(PartialSum#93) AS m2#4,Coalesce(SUM(PartialCount#94L),0) AS 
m3#5L,CombineAndCount(partialSets#95) AS m4#6L,SUM(PartialSum#96) AS m5#7]
 Exchange (HashPartitioning [the_year#51,product_family#69], 200)
  Aggregate true, [the_year#51,product_family#69], 
[the_year#51,product_family#69,SUM(store_cost#63) AS 
PartialSum#92,AddToHashSet(customer_id#59) AS 
partialSets#95,SUM(store_sales#62) AS PartialSum#93,SUM(CASE WHEN 
(promotion_id#60 = 0) THEN 0.0 ELSE store_sales#62) AS 
PartialSum#96,SUM(unit_sales#64) AS PartialSum#91,COUNT(product_id#57) AS 
PartialCount#94L]
   Project 
[store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,product_family#69,promotion_id#60]
    BroadcastHashJoin [product_id#57,product_class_id#65], 
[product_id#71,product_class_id#70], BuildRight
     Project 
[store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_class_id#65,product_id#57,product_family#69,promotion_id#60]
      CartesianProduct
       Project 
[store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,promotion_id#60]
        BroadcastHashJoin [time_id#47], [time_id#58], BuildRight
         HiveTableScan [time_id#47,the_year#51], (MetastoreRelation foodmart, 
time_by_day, Some(time_by_day)), None
         HiveTableScan 
[store_cost#63,store_sales#62,unit_sales#64,customer_id#59,product_id#57,time_id#58,promotion_id#60],
 (MetastoreRelation foodmart, sales_fact_1997, Some(sales_fact_1997)), None
       HiveTableScan [product_family#69,product_class_id#65], 
(MetastoreRelation foodmart, product_class, Some(product_class)), None
     HiveTableScan [product_id#71,product_class_id#70], (MetastoreRelation 
foodmart, product, Some(product)), None



As you see, there is a cartesian product, which gives me "outofmemory" 
exception even on such a small database as this one.


The query was generated by Mondrian, it is not human-generated.  But I tested 
the same query on MySQL - it works JUST FINE. And on Hive2 this query also 
works fine, independently what is the order of the tables in the "from" clause.

Could you please fix this in the future Spark versions?

Thanks.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to