[ https://issues.apache.org/jira/browse/SPARK-8081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-8081. --------------------------------- Resolution: Invalid I hope there are no such JIRAs, like, copying and just pasting the production query. This is really painful to reproduce such problem. To verify this, it seems I need to install sqoop, mysql and Hive. Even the script itself is not executable as is. So, the problem is {quote} As you see, there is a cartesian product, which gives me "outofmemory" exception even on such a small database as this one. {quote} ? I strongly feel it should go to mailing list as a question before we consider this as an issue (there is no info about your environment, settings and setups). Please reopen _if anyone can reproduce this_, except for the author of this JIRA who seems not active more than one and half years. > Problems with Optimized Logical Plan Generation > ----------------------------------------------- > > Key: SPARK-8081 > URL: https://issues.apache.org/jira/browse/SPARK-8081 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.2.2, 1.3.1 > Reporter: Ihor Bobak > Labels: performance > > You can reproduce the issue on this database: > https://drive.google.com/file/d/0B3DMXMfcPWF3UEtsTjJ3aGMzOEE/view > This is a Foodmart script for mysql, and a script for importing this database > to Hive. > The problem is next. When we run the following query on Spark SQL: > SELECT `time_by_day`.`the_year` `c0` > ,`product_class`.`product_family` `c1` > ,SUM(`sales_fact_1997`.`unit_sales`) `m0` > ,SUM(`sales_fact_1997`.`store_cost`) `m1` > ,SUM(`sales_fact_1997`.`store_sales`) `m2` > ,COUNT(`sales_fact_1997`.`product_id`) `m3` > ,COUNT(DISTINCT `sales_fact_1997`.`customer_id`) `m4` > ,SUM(( > CASE > WHEN `sales_fact_1997`.`promotion_id` = 0 > THEN 0 > ELSE `sales_fact_1997`.`store_sales` > END > )) `m5` > FROM `foodmart`.`time_by_day` `time_by_day` > CROSS JOIN `foodmart`.`sales_fact_1997` `sales_fact_1997` > CROSS JOIN `foodmart`.`product` `product` > CROSS JOIN `foodmart`.`product_class` `product_class` > WHERE > `sales_fact_1997`.`time_id` = `time_by_day`.`time_id` > AND `sales_fact_1997`.`product_id` = > `product`.`product_id` > AND `product`.`product_class_id` = > `product_class`.`product_class_id` > GROUP BY `time_by_day`.`the_year` > ,`product_class`.`product_family` > the plan will be OK: > == Optimized Logical Plan == > Aggregate [the_year#51,product_family#84], [the_year#51 AS > c0#0,product_family#84 AS c1#1,SUM(unit_sales#64) AS m0#2,SUM(store_cost#63) > AS m1#3,SUM(store_sales#62) AS m2#4,COUNT(product_id#57) AS > m3#5L,COUNT(DISTINCT customer_id#59) AS m4#6L,SUM(CASE WHEN (promotion_id#60 > = 0) THEN 0.0 ELSE store_sales#62) AS m5#7] > Project > [store_cost#63,the_year#51,store_sales#62,product_family#84,unit_sales#64,customer_id#59,product_id#57,promotion_id#60] > Join Inner, Some((product_class_id#65 = product_class_id#80)) > Project > [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_class_id#65,product_id#57,promotion_id#60] > Join Inner, Some((product_id#57 = product_id#66)) > Project > [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,promotion_id#60] > Join Inner, Some((time_id#58 = time_id#47)) > Project [time_id#47,the_year#51] > MetastoreRelation foodmart, time_by_day, Some(time_by_day) > Project > [store_cost#63,store_sales#62,unit_sales#64,customer_id#59,product_id#57,time_id#58,promotion_id#60] > MetastoreRelation foodmart, sales_fact_1997, Some(sales_fact_1997) > Project [product_id#66,product_class_id#65] > MetastoreRelation foodmart, product, Some(product) > Project [product_family#84,product_class_id#80] > MetastoreRelation foodmart, product_class, Some(product_class) > == Physical Plan == > Aggregate false, [the_year#51,product_family#84], [the_year#51 AS > c0#0,product_family#84 AS c1#1,SUM(PartialSum#91) AS m0#2,SUM(PartialSum#92) > AS m1#3,SUM(PartialSum#93) AS m2#4,Coalesce(SUM(PartialCount#94L),0) AS > m3#5L,CombineAndCount(partialSets#95) AS m4#6L,SUM(PartialSum#96) AS m5#7] > Exchange (HashPartitioning [the_year#51,product_family#84], 200) > Aggregate true, [the_year#51,product_family#84], > [the_year#51,product_family#84,SUM(store_cost#63) AS > PartialSum#92,AddToHashSet(customer_id#59) AS > partialSets#95,SUM(store_sales#62) AS PartialSum#93,SUM(CASE WHEN > (promotion_id#60 = 0) THEN 0.0 ELSE store_sales#62) AS > PartialSum#96,SUM(unit_sales#64) AS PartialSum#91,COUNT(product_id#57) AS > PartialCount#94L] > Project > [store_cost#63,the_year#51,store_sales#62,product_family#84,unit_sales#64,customer_id#59,product_id#57,promotion_id#60] > BroadcastHashJoin [product_class_id#65], [product_class_id#80], BuildRight > Project > [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_class_id#65,product_id#57,promotion_id#60] > BroadcastHashJoin [product_id#57], [product_id#66], BuildRight > Project > [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,promotion_id#60] > BroadcastHashJoin [time_id#47], [time_id#58], BuildRight > HiveTableScan [time_id#47,the_year#51], (MetastoreRelation foodmart, > time_by_day, Some(time_by_day)), None > HiveTableScan > [store_cost#63,store_sales#62,unit_sales#64,customer_id#59,product_id#57,time_id#58,promotion_id#60], > (MetastoreRelation foodmart, sales_fact_1997, Some(sales_fact_1997)), None > HiveTableScan [product_id#66,product_class_id#65], (MetastoreRelation > foodmart, product, Some(product)), None > HiveTableScan [product_family#84,product_class_id#80], > (MetastoreRelation foodmart, product_class, Some(product_class)), None > But as soon as we run the same query, but with order of tables product and > product_class changed: > SELECT `time_by_day`.`the_year` `c0` > ,`product_class`.`product_family` `c1` > ,SUM(`sales_fact_1997`.`unit_sales`) `m0` > ,SUM(`sales_fact_1997`.`store_cost`) `m1` > ,SUM(`sales_fact_1997`.`store_sales`) `m2` > ,COUNT(`sales_fact_1997`.`product_id`) `m3` > ,COUNT(DISTINCT `sales_fact_1997`.`customer_id`) `m4` > ,SUM(( > CASE > WHEN `sales_fact_1997`.`promotion_id` = 0 > THEN 0 > ELSE `sales_fact_1997`.`store_sales` > END > )) `m5` > FROM `foodmart`.`time_by_day` `time_by_day` > CROSS JOIN `foodmart`.`sales_fact_1997` `sales_fact_1997` > CROSS JOIN `foodmart`.`product_class` `product_class` > CROSS JOIN `foodmart`.`product` `product` > WHERE > `sales_fact_1997`.`time_id` = `time_by_day`.`time_id` > AND `sales_fact_1997`.`product_id` = > `product`.`product_id` > AND `product`.`product_class_id` = > `product_class`.`product_class_id` > GROUP BY `time_by_day`.`the_year` > ,`product_class`.`product_family` > we will get this: > == Optimized Logical Plan == > Aggregate [the_year#51,product_family#69], [the_year#51 AS > c0#0,product_family#69 AS c1#1,SUM(unit_sales#64) AS m0#2,SUM(store_cost#63) > AS m1#3,SUM(store_sales#62) AS m2#4,COUNT(product_id#57) AS > m3#5L,COUNT(DISTINCT customer_id#59) AS m4#6L,SUM(CASE WHEN (promotion_id#60 > = 0) THEN 0.0 ELSE store_sales#62) AS m5#7] > Project > [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,product_family#69,promotion_id#60] > Join Inner, Some(((product_id#57 = product_id#71) && (product_class_id#70 = > product_class_id#65))) > Project > [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_class_id#65,product_id#57,product_family#69,promotion_id#60] > Join Inner, None > Project > [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,promotion_id#60] > Join Inner, Some((time_id#58 = time_id#47)) > Project [time_id#47,the_year#51] > MetastoreRelation foodmart, time_by_day, Some(time_by_day) > Project > [store_cost#63,store_sales#62,unit_sales#64,customer_id#59,product_id#57,time_id#58,promotion_id#60] > MetastoreRelation foodmart, sales_fact_1997, Some(sales_fact_1997) > Project [product_family#69,product_class_id#65] > MetastoreRelation foodmart, product_class, Some(product_class) > Project [product_id#71,product_class_id#70] > MetastoreRelation foodmart, product, Some(product) > == Physical Plan == > Aggregate false, [the_year#51,product_family#69], [the_year#51 AS > c0#0,product_family#69 AS c1#1,SUM(PartialSum#91) AS m0#2,SUM(PartialSum#92) > AS m1#3,SUM(PartialSum#93) AS m2#4,Coalesce(SUM(PartialCount#94L),0) AS > m3#5L,CombineAndCount(partialSets#95) AS m4#6L,SUM(PartialSum#96) AS m5#7] > Exchange (HashPartitioning [the_year#51,product_family#69], 200) > Aggregate true, [the_year#51,product_family#69], > [the_year#51,product_family#69,SUM(store_cost#63) AS > PartialSum#92,AddToHashSet(customer_id#59) AS > partialSets#95,SUM(store_sales#62) AS PartialSum#93,SUM(CASE WHEN > (promotion_id#60 = 0) THEN 0.0 ELSE store_sales#62) AS > PartialSum#96,SUM(unit_sales#64) AS PartialSum#91,COUNT(product_id#57) AS > PartialCount#94L] > Project > [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,product_family#69,promotion_id#60] > BroadcastHashJoin [product_id#57,product_class_id#65], > [product_id#71,product_class_id#70], BuildRight > Project > [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_class_id#65,product_id#57,product_family#69,promotion_id#60] > CartesianProduct > Project > [store_cost#63,the_year#51,store_sales#62,unit_sales#64,customer_id#59,product_id#57,promotion_id#60] > BroadcastHashJoin [time_id#47], [time_id#58], BuildRight > HiveTableScan [time_id#47,the_year#51], (MetastoreRelation foodmart, > time_by_day, Some(time_by_day)), None > HiveTableScan > [store_cost#63,store_sales#62,unit_sales#64,customer_id#59,product_id#57,time_id#58,promotion_id#60], > (MetastoreRelation foodmart, sales_fact_1997, Some(sales_fact_1997)), None > HiveTableScan [product_family#69,product_class_id#65], > (MetastoreRelation foodmart, product_class, Some(product_class)), None > HiveTableScan [product_id#71,product_class_id#70], (MetastoreRelation > foodmart, product, Some(product)), None > As you see, there is a cartesian product, which gives me "outofmemory" > exception even on such a small database as this one. > The query was generated by Mondrian, it is not human-generated. But I tested > the same query on MySQL - it works JUST FINE. And on Hive2 this query also > works fine, independently what is the order of the tables in the "from" > clause. > Could you please fix this in the future Spark versions? > Thanks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org