[jira] [Commented] (SPARK-39022) Spark SQL - Combination of HAVING and SORT not resolved correctly

2022-04-27 Thread Lukas Grasmann (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528863#comment-17528863
 ] 

Lukas Grasmann commented on SPARK-39022:


This is my first contribution. How do I assign this issue to myself?

> Spark SQL - Combination of HAVING and SORT not resolved correctly
> -
>
> Key: SPARK-39022
> URL: https://issues.apache.org/jira/browse/SPARK-39022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.4.0
>Reporter: Lukas Grasmann
>Priority: Major
> Attachments: explain_new.txt, explain_old.txt
>
>
> Example: Given a simple relation {{test}} with two relevant columns {{hotel}} 
> and {{price}} where {{hotel}} is a unique identifier of a hotel and {{price}} 
> is the cost of a night's stay. We would then like to order the {{{}hotel{}}}s 
> by their cumulative prices but only for hotels where the cumulative price is 
> higher than {{{}150{}}}.
> h2. Current Behavior
> To achieve the goal specified above, we give a simple query that works in 
> most common database systems. Note that we only retrieve {{hotel}} in the 
> {{SELECT ... FROM}} statement which means that the aggregate has to be 
> removed from the result attributes using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel FROM test GROUP BY hotel HAVING sum(price) > 150 
> ORDER BY sum(price)").show{code}
> Currently, this yields an {{AnalysisException}} since the aggregate 
> {{sum(price)}} in {{Sort}} is not resolved correctly. Note that the child of 
> {{Sort}} is a (premature) {{Project}} node which only provides {{hotel}} as 
> its output. This prevents the aggregate values from being passed to 
> {{{}Sort{}}}.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [test.hotel]; line 1 pos 75;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Project [hotel#17]
>+- Filter (sum(cast(price#18 as double))#22 > cast(150 as double))
>   +- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(cast(price#18 as double))#22]
>  +- SubqueryAlias test
> +- View (`test`, [hotel#17,price#18])
>+- Relation [hotel#17,price#18] csv
> {code}
> The {{AnalysisException}} itself, however, is not caused by the introduced 
> {{Project}} as can be seen in the following example. Here, {{sum(price)}} is 
> part of the result and therefore *not* removed using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel, sum(price) FROM test GROUP BY hotel HAVING 
> sum(price) > 150 ORDER BY sum(price)").show{code}
> Resolving the aggregate {{sum(price)}} (i.e., resolving it to the aggregate 
> introduced by the {{Aggregate}} node) is still not successful even if there 
> is no {{{}Project{}}}. Spark still throws the following {{AnalysisException}} 
> which is similar to the exception from before. It follows that there is a 
> second error in the analyzer that still prevents successful resolution even 
> if the problem regarding the {{Project}} node is fixed.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [sum(price), test.hotel]; line 1 pos 87;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Filter (sum(price)#24 > cast(150 as double))
>+- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(price)#24]
>   +- SubqueryAlias test
>  +- View (`test`, [hotel#17,price#18])
> +- Relation [hotel#17,price#18] csv
> {code}
>  
> This error occurs (at least) in Spark versions 3.1.2, 3.2.1, as well as the 
> latest version from the GitHub {{master}} branch.
> h2. Current Workaround
> The issue can currently be worked around by using a subquery to first 
> retrieve only the hotels which fulfill the condition and then ordering them 
> in the outer query:
> {code:sql}
> SELECT hotel, sum_price FROM
> (SELECT hotel, sum(price) AS sum_price FROM test GROUP BY hotel HAVING 
> sum(price) > 150) sub
> ORDER BY sum_price;
> {code}
> h2. Proposed Solution(s)
> The first change fixes the (premature) insertion of {{Project}} before a 
> {{Sort}} by moving the {{Project}} up in the plan such that the {{Project}} 
> is then parent of the {{Sort}} instead of vice versa. This does not change 
> the results of the computations since both {{Sort}} and {{Project}} do not 
> add or remove tuples from the result.
> There are two potential side-effects to this solution:
>  * May change some plans generated by DataFrame/DataSet which previously also 
> produced similar errors such that they now yield a result instead. However, 
> this is unlikely to produce unexpected/undesired results (see above).
>  * Moving the 

[jira] [Commented] (SPARK-39022) Spark SQL - Combination of HAVING and SORT not resolved correctly

2022-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528856#comment-17528856
 ] 

Apache Spark commented on SPARK-39022:
--

User 'Lukas-Grasmann' has created a pull request for this issue:
https://github.com/apache/spark/pull/36378

> Spark SQL - Combination of HAVING and SORT not resolved correctly
> -
>
> Key: SPARK-39022
> URL: https://issues.apache.org/jira/browse/SPARK-39022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.4.0
>Reporter: Lukas Grasmann
>Priority: Major
> Attachments: explain_new.txt, explain_old.txt
>
>
> Example: Given a simple relation {{test}} with two relevant columns {{hotel}} 
> and {{price}} where {{hotel}} is a unique identifier of a hotel and {{price}} 
> is the cost of a night's stay. We would then like to order the {{{}hotel{}}}s 
> by their cumulative prices but only for hotels where the cumulative price is 
> higher than {{{}150{}}}.
> h2. Current Behavior
> To achieve the goal specified above, we give a simple query that works in 
> most common database systems. Note that we only retrieve {{hotel}} in the 
> {{SELECT ... FROM}} statement which means that the aggregate has to be 
> removed from the result attributes using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel FROM test GROUP BY hotel HAVING sum(price) > 150 
> ORDER BY sum(price)").show{code}
> Currently, this yields an {{AnalysisException}} since the aggregate 
> {{sum(price)}} in {{Sort}} is not resolved correctly. Note that the child of 
> {{Sort}} is a (premature) {{Project}} node which only provides {{hotel}} as 
> its output. This prevents the aggregate values from being passed to 
> {{{}Sort{}}}.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [test.hotel]; line 1 pos 75;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Project [hotel#17]
>+- Filter (sum(cast(price#18 as double))#22 > cast(150 as double))
>   +- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(cast(price#18 as double))#22]
>  +- SubqueryAlias test
> +- View (`test`, [hotel#17,price#18])
>+- Relation [hotel#17,price#18] csv
> {code}
> The {{AnalysisException}} itself, however, is not caused by the introduced 
> {{Project}} as can be seen in the following example. Here, {{sum(price)}} is 
> part of the result and therefore *not* removed using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel, sum(price) FROM test GROUP BY hotel HAVING 
> sum(price) > 150 ORDER BY sum(price)").show{code}
> Resolving the aggregate {{sum(price)}} (i.e., resolving it to the aggregate 
> introduced by the {{Aggregate}} node) is still not successful even if there 
> is no {{{}Project{}}}. Spark still throws the following {{AnalysisException}} 
> which is similar to the exception from before. It follows that there is a 
> second error in the analyzer that still prevents successful resolution even 
> if the problem regarding the {{Project}} node is fixed.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [sum(price), test.hotel]; line 1 pos 87;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Filter (sum(price)#24 > cast(150 as double))
>+- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(price)#24]
>   +- SubqueryAlias test
>  +- View (`test`, [hotel#17,price#18])
> +- Relation [hotel#17,price#18] csv
> {code}
>  
> This error occurs (at least) in Spark versions 3.1.2, 3.2.1, as well as the 
> latest version from the GitHub {{master}} branch.
> h2. Current Workaround
> The issue can currently be worked around by using a subquery to first 
> retrieve only the hotels which fulfill the condition and then ordering them 
> in the outer query:
> {code:sql}
> SELECT hotel, sum_price FROM
> (SELECT hotel, sum(price) AS sum_price FROM test GROUP BY hotel HAVING 
> sum(price) > 150) sub
> ORDER BY sum_price;
> {code}
> h2. Proposed Solution(s)
> The first change fixes the (premature) insertion of {{Project}} before a 
> {{Sort}} by moving the {{Project}} up in the plan such that the {{Project}} 
> is then parent of the {{Sort}} instead of vice versa. This does not change 
> the results of the computations since both {{Sort}} and {{Project}} do not 
> add or remove tuples from the result.
> There are two potential side-effects to this solution:
>  * May change some plans generated by DataFrame/DataSet which previously also 
> produced similar errors such that they now yield a result instead. However, 
> this is unlikely to produce unexpected/undesired 

[jira] [Commented] (SPARK-39022) Spark SQL - Combination of HAVING and SORT not resolved correctly

2022-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528855#comment-17528855
 ] 

Apache Spark commented on SPARK-39022:
--

User 'Lukas-Grasmann' has created a pull request for this issue:
https://github.com/apache/spark/pull/36378

> Spark SQL - Combination of HAVING and SORT not resolved correctly
> -
>
> Key: SPARK-39022
> URL: https://issues.apache.org/jira/browse/SPARK-39022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.4.0
>Reporter: Lukas Grasmann
>Priority: Major
> Attachments: explain_new.txt, explain_old.txt
>
>
> Example: Given a simple relation {{test}} with two relevant columns {{hotel}} 
> and {{price}} where {{hotel}} is a unique identifier of a hotel and {{price}} 
> is the cost of a night's stay. We would then like to order the {{{}hotel{}}}s 
> by their cumulative prices but only for hotels where the cumulative price is 
> higher than {{{}150{}}}.
> h2. Current Behavior
> To achieve the goal specified above, we give a simple query that works in 
> most common database systems. Note that we only retrieve {{hotel}} in the 
> {{SELECT ... FROM}} statement which means that the aggregate has to be 
> removed from the result attributes using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel FROM test GROUP BY hotel HAVING sum(price) > 150 
> ORDER BY sum(price)").show{code}
> Currently, this yields an {{AnalysisException}} since the aggregate 
> {{sum(price)}} in {{Sort}} is not resolved correctly. Note that the child of 
> {{Sort}} is a (premature) {{Project}} node which only provides {{hotel}} as 
> its output. This prevents the aggregate values from being passed to 
> {{{}Sort{}}}.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [test.hotel]; line 1 pos 75;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Project [hotel#17]
>+- Filter (sum(cast(price#18 as double))#22 > cast(150 as double))
>   +- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(cast(price#18 as double))#22]
>  +- SubqueryAlias test
> +- View (`test`, [hotel#17,price#18])
>+- Relation [hotel#17,price#18] csv
> {code}
> The {{AnalysisException}} itself, however, is not caused by the introduced 
> {{Project}} as can be seen in the following example. Here, {{sum(price)}} is 
> part of the result and therefore *not* removed using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel, sum(price) FROM test GROUP BY hotel HAVING 
> sum(price) > 150 ORDER BY sum(price)").show{code}
> Resolving the aggregate {{sum(price)}} (i.e., resolving it to the aggregate 
> introduced by the {{Aggregate}} node) is still not successful even if there 
> is no {{{}Project{}}}. Spark still throws the following {{AnalysisException}} 
> which is similar to the exception from before. It follows that there is a 
> second error in the analyzer that still prevents successful resolution even 
> if the problem regarding the {{Project}} node is fixed.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [sum(price), test.hotel]; line 1 pos 87;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Filter (sum(price)#24 > cast(150 as double))
>+- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(price)#24]
>   +- SubqueryAlias test
>  +- View (`test`, [hotel#17,price#18])
> +- Relation [hotel#17,price#18] csv
> {code}
>  
> This error occurs (at least) in Spark versions 3.1.2, 3.2.1, as well as the 
> latest version from the GitHub {{master}} branch.
> h2. Current Workaround
> The issue can currently be worked around by using a subquery to first 
> retrieve only the hotels which fulfill the condition and then ordering them 
> in the outer query:
> {code:sql}
> SELECT hotel, sum_price FROM
> (SELECT hotel, sum(price) AS sum_price FROM test GROUP BY hotel HAVING 
> sum(price) > 150) sub
> ORDER BY sum_price;
> {code}
> h2. Proposed Solution(s)
> The first change fixes the (premature) insertion of {{Project}} before a 
> {{Sort}} by moving the {{Project}} up in the plan such that the {{Project}} 
> is then parent of the {{Sort}} instead of vice versa. This does not change 
> the results of the computations since both {{Sort}} and {{Project}} do not 
> add or remove tuples from the result.
> There are two potential side-effects to this solution:
>  * May change some plans generated by DataFrame/DataSet which previously also 
> produced similar errors such that they now yield a result instead. However, 
> this is unlikely to produce unexpected/undesired 

[jira] [Commented] (SPARK-39022) Spark SQL - Combination of HAVING and SORT not resolved correctly

2022-04-26 Thread Lukas Grasmann (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528071#comment-17528071
 ] 

Lukas Grasmann commented on SPARK-39022:


I am working on a Pull Request for this issue in case the solution described is 
ok.

> Spark SQL - Combination of HAVING and SORT not resolved correctly
> -
>
> Key: SPARK-39022
> URL: https://issues.apache.org/jira/browse/SPARK-39022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.4
>Reporter: Lukas Grasmann
>Priority: Major
> Attachments: explain_new.txt, explain_old.txt
>
>
> Example: Given a simple relation {{test}} with two relevant columns {{hotel}} 
> and {{price}} where {{hotel}} is a unique identifier of a hotel and {{price}} 
> is the cost of a night's stay. We would then like to order the {{{}hotel{}}}s 
> by their cumulative prices but only for hotels where the cumulative price is 
> higher than {{{}150{}}}.
> h2. Current Behavior
> To achieve the goal specified above, we give a simple query that works in 
> most common database systems. Note that we only retrieve {{hotel}} in the 
> {{SELECT ... FROM}} statement which means that the aggregate has to be 
> removed from the result attributes using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel FROM test GROUP BY hotel HAVING sum(price) > 150 
> ORDER BY sum(price)").show{code}
> Currently, this yields an {{AnalysisException}} since the aggregate 
> {{sum(price)}} in {{Sort}} is not resolved correctly. Note that the child of 
> {{Sort}} is a (premature) {{Project}} node which only provides {{hotel}} as 
> its output. This prevents the aggregate values from being passed to 
> {{{}Sort{}}}.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [test.hotel]; line 1 pos 75;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Project [hotel#17]
>+- Filter (sum(cast(price#18 as double))#22 > cast(150 as double))
>   +- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(cast(price#18 as double))#22]
>  +- SubqueryAlias test
> +- View (`test`, [hotel#17,price#18])
>+- Relation [hotel#17,price#18] csv
> {code}
> The {{AnalysisException}} itself, however, is not caused by the introduced 
> {{Project}} as can be seen in the following example. Here, {{sum(price)}} is 
> part of the result and therefore *not* removed using a {{Project}} node.
> {code:scala}
> sqlcontext.sql("SELECT hotel, sum(price) FROM test GROUP BY hotel HAVING 
> sum(price) > 150 ORDER BY sum(price)").show{code}
> Resolving the aggregate {{sum(price)}} (i.e., resolving it to the aggregate 
> introduced by the {{Aggregate}} node) is still not successful even if there 
> is no {{{}Project{}}}. Spark still throws the following {{AnalysisException}} 
> which is similar to the exception from before. It follows that there is a 
> second error in the analyzer that still prevents successful resolution even 
> if the problem regarding the {{Project}} node is fixed.
> {code:scala}
> org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did 
> you mean one of the following? [sum(price), test.hotel]; line 1 pos 87;
> 'Sort ['sum('price) ASC NULLS FIRST], true
> +- Filter (sum(price)#24 > cast(150 as double))
>+- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS 
> sum(price)#24]
>   +- SubqueryAlias test
>  +- View (`test`, [hotel#17,price#18])
> +- Relation [hotel#17,price#18] csv
> {code}
>  
> This error occurs (at least) in Spark versions 3.1.2, 3.2.1, as well as the 
> latest version from the GitHub {{master}} branch.
> h2. Current Workaround
> The issue can currently be worked around by using a subquery to first 
> retrieve only the hotels which fulfill the condition and then ordering them 
> in the outer query:
> {code:sql}
> SELECT hotel, sum_price FROM
> (SELECT hotel, sum(price) AS sum_price FROM test GROUP BY hotel HAVING 
> sum(price) > 150) sub
> ORDER BY sum_price;
> {code}
> h2. Proposed Solution(s)
> The first change fixes the (premature) insertion of {{Project}} before a 
> {{Sort}} by moving the {{Project}} up in the plan such that the {{Project}} 
> is then parent of the {{Sort}} instead of vice versa. This does not change 
> the results of the computations since both {{Sort}} and {{Project}} do not 
> add or remove tuples from the result.
> There are two potential side-effects to this solution:
>  * May change some plans generated by DataFrame/DataSet which previously also 
> produced similar errors such that they now yield a result instead. However, 
> this is unlikely to produce unexpected/undesired results (see above).
>  *