[jira] [Comment Edited] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737142#comment-15737142
 ] 

Nattavut Sutyanyong edited comment on SPARK-18814 at 12/10/16 3:50 AM:
---

It looks like the {{Project}} between {{Aggregate}} and {{Filter 
scalar-subquery}} maps {{cs_item_sk#39}} to {{cs_item_sk#39#111}}. The logic in 
the code is not robust enough to recognize that the two symbols are equivalent. 
I tried to simplify the problem to

{code}
Seq[(java.lang.Integer, 
scalar.lang.BigDecimal)]((1,BigDecimal(1.0))).toDF("k","v").createOrReplaceTempView("P")
Seq[(java.lang.Integer, 
scala.math.BigDecimal)]((1,BigDecimal(1.0))).toDF("k1","v1").createOrReplaceTempView("C")

sql("select * from p where v = (select 1.1 * avg(v1) from c where 
c.k1=p.k)").explain(true)
{code}

This should have all the elements required to reproduce the problem but somehow 
I could not get the required `Project` operator so there is no mapping of the 
column {{p.k}} as it is in the TPCDS-Q32.

I will keep trying.


was (Author: nsyca):
It looks like the {{Project}} between {{Aggregate}} and {{Filter 
scalar-subquery}} maps {{cs_item_sk#39}} to {{cs_item_sk#39#111}}. The logic in 
the code is not robust enough to recognize that the two symbols are equivalent. 
I tried to simplify the problem to

{code}
Seq[(java.lang.Integer, 
scalar.lang.BigDecimal)]((1,BigDecimal(1.0))).toDF("k","v").createOrReplaceTempView("P")
Seq[(java.lang.Integer, 
scala.math.BigDecimal)]((1,BigDecimal(1.0))).toDF("k1","v1").createOrReplaceTempView("C")

sql("select * from p where v = (select 1.1 * avg(v1) from c where 
c.k1=p.k)").explain(true)
{code}

This should have all the elements required to reproduce the problem but somehow 
I could not get the required `Project` operator so there is no mapping of the 
column p.k as it is in the TPCDS-Q32.

I will keep trying.

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : 

[jira] [Comment Edited] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737142#comment-15737142
 ] 

Nattavut Sutyanyong edited comment on SPARK-18814 at 12/10/16 3:50 AM:
---

It looks like the {{Project}} between {{Aggregate}} and {{Filter 
scalar-subquery}} maps {{cs_item_sk#39}} to {{cs_item_sk#39#111}}. The logic in 
the code is not robust enough to recognize that the two symbols are equivalent. 
I tried to simplify the problem to

{code}
Seq[(java.lang.Integer, 
scalar.lang.BigDecimal)]((1,BigDecimal(1.0))).toDF("k","v").createOrReplaceTempView("P")
Seq[(java.lang.Integer, 
scala.math.BigDecimal)]((1,BigDecimal(1.0))).toDF("k1","v1").createOrReplaceTempView("C")

sql("select * from p where v = (select 1.1 * avg(v1) from c where 
c.k1=p.k)").explain(true)
{code}

This should have all the elements required to reproduce the problem but somehow 
I could not get the required `Project` operator so there is no mapping of the 
column p.k as it is in the TPCDS-Q32.

I will keep trying.


was (Author: nsyca):
It looks like the `Project` between `Aggregate` and `Filter scalar-subquery` 
maps `cs_item_sk#39` to `cs_item_sk#39#111`. The logic in the code is not 
robust enough to recognize that the two symbols are equivalent. I tried to 
simplify the problem to

{code}
Seq[(java.lang.Integer, 
scalar.lang.BigDecimal)]((1,BigDecimal(1.0))).toDF("k","v").createOrReplaceTempView("P")
Seq[(java.lang.Integer, 
scala.math.BigDecimal)]((1,BigDecimal(1.0))).toDF("k1","v1").createOrReplaceTempView("C")

sql("select * from p where v = (select 1.1 * avg(v1) from c where 
c.k1=p.k)").explain(true)
{code}

This should have all the elements required to reproduce the problem but somehow 
I could not get the required `Project` operator so there is no mapping of the 
column p.k as it is in the TPCDS-Q32.

I will keep trying.

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  :