[jira] [Commented] (SPARK-39022) Spark SQL - Combination of HAVING and SORT not resolved correctly
[ https://issues.apache.org/jira/browse/SPARK-39022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528863#comment-17528863 ] Lukas Grasmann commented on SPARK-39022: This is my first contribution. How do I assign this issue to myself? > Spark SQL - Combination of HAVING and SORT not resolved correctly > - > > Key: SPARK-39022 > URL: https://issues.apache.org/jira/browse/SPARK-39022 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1, 3.4.0 >Reporter: Lukas Grasmann >Priority: Major > Attachments: explain_new.txt, explain_old.txt > > > Example: Given a simple relation {{test}} with two relevant columns {{hotel}} > and {{price}} where {{hotel}} is a unique identifier of a hotel and {{price}} > is the cost of a night's stay. We would then like to order the {{{}hotel{}}}s > by their cumulative prices but only for hotels where the cumulative price is > higher than {{{}150{}}}. > h2. Current Behavior > To achieve the goal specified above, we give a simple query that works in > most common database systems. Note that we only retrieve {{hotel}} in the > {{SELECT ... FROM}} statement which means that the aggregate has to be > removed from the result attributes using a {{Project}} node. > {code:scala} > sqlcontext.sql("SELECT hotel FROM test GROUP BY hotel HAVING sum(price) > 150 > ORDER BY sum(price)").show{code} > Currently, this yields an {{AnalysisException}} since the aggregate > {{sum(price)}} in {{Sort}} is not resolved correctly. Note that the child of > {{Sort}} is a (premature) {{Project}} node which only provides {{hotel}} as > its output. This prevents the aggregate values from being passed to > {{{}Sort{}}}. > {code:scala} > org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did > you mean one of the following? [test.hotel]; line 1 pos 75; > 'Sort ['sum('price) ASC NULLS FIRST], true > +- Project [hotel#17] >+- Filter (sum(cast(price#18 as double))#22 > cast(150 as double)) > +- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS > sum(cast(price#18 as double))#22] > +- SubqueryAlias test > +- View (`test`, [hotel#17,price#18]) >+- Relation [hotel#17,price#18] csv > {code} > The {{AnalysisException}} itself, however, is not caused by the introduced > {{Project}} as can be seen in the following example. Here, {{sum(price)}} is > part of the result and therefore *not* removed using a {{Project}} node. > {code:scala} > sqlcontext.sql("SELECT hotel, sum(price) FROM test GROUP BY hotel HAVING > sum(price) > 150 ORDER BY sum(price)").show{code} > Resolving the aggregate {{sum(price)}} (i.e., resolving it to the aggregate > introduced by the {{Aggregate}} node) is still not successful even if there > is no {{{}Project{}}}. Spark still throws the following {{AnalysisException}} > which is similar to the exception from before. It follows that there is a > second error in the analyzer that still prevents successful resolution even > if the problem regarding the {{Project}} node is fixed. > {code:scala} > org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did > you mean one of the following? [sum(price), test.hotel]; line 1 pos 87; > 'Sort ['sum('price) ASC NULLS FIRST], true > +- Filter (sum(price)#24 > cast(150 as double)) >+- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS > sum(price)#24] > +- SubqueryAlias test > +- View (`test`, [hotel#17,price#18]) > +- Relation [hotel#17,price#18] csv > {code} > > This error occurs (at least) in Spark versions 3.1.2, 3.2.1, as well as the > latest version from the GitHub {{master}} branch. > h2. Current Workaround > The issue can currently be worked around by using a subquery to first > retrieve only the hotels which fulfill the condition and then ordering them > in the outer query: > {code:sql} > SELECT hotel, sum_price FROM > (SELECT hotel, sum(price) AS sum_price FROM test GROUP BY hotel HAVING > sum(price) > 150) sub > ORDER BY sum_price; > {code} > h2. Proposed Solution(s) > The first change fixes the (premature) insertion of {{Project}} before a > {{Sort}} by moving the {{Project}} up in the plan such that the {{Project}} > is then parent of the {{Sort}} instead of vice versa. This does not change > the results of the computations since both {{Sort}} and {{Project}} do not > add or remove tuples from the result. > There are two potential side-effects to this solution: > * May change some plans generated by DataFrame/DataSet which previously also > produced similar errors such that they now yield a result instead. However, > this is unlikely to produce unexpected/undesired results (see above). > * Moving the
[jira] [Commented] (SPARK-39022) Spark SQL - Combination of HAVING and SORT not resolved correctly
[ https://issues.apache.org/jira/browse/SPARK-39022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528856#comment-17528856 ] Apache Spark commented on SPARK-39022: -- User 'Lukas-Grasmann' has created a pull request for this issue: https://github.com/apache/spark/pull/36378 > Spark SQL - Combination of HAVING and SORT not resolved correctly > - > > Key: SPARK-39022 > URL: https://issues.apache.org/jira/browse/SPARK-39022 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1, 3.4.0 >Reporter: Lukas Grasmann >Priority: Major > Attachments: explain_new.txt, explain_old.txt > > > Example: Given a simple relation {{test}} with two relevant columns {{hotel}} > and {{price}} where {{hotel}} is a unique identifier of a hotel and {{price}} > is the cost of a night's stay. We would then like to order the {{{}hotel{}}}s > by their cumulative prices but only for hotels where the cumulative price is > higher than {{{}150{}}}. > h2. Current Behavior > To achieve the goal specified above, we give a simple query that works in > most common database systems. Note that we only retrieve {{hotel}} in the > {{SELECT ... FROM}} statement which means that the aggregate has to be > removed from the result attributes using a {{Project}} node. > {code:scala} > sqlcontext.sql("SELECT hotel FROM test GROUP BY hotel HAVING sum(price) > 150 > ORDER BY sum(price)").show{code} > Currently, this yields an {{AnalysisException}} since the aggregate > {{sum(price)}} in {{Sort}} is not resolved correctly. Note that the child of > {{Sort}} is a (premature) {{Project}} node which only provides {{hotel}} as > its output. This prevents the aggregate values from being passed to > {{{}Sort{}}}. > {code:scala} > org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did > you mean one of the following? [test.hotel]; line 1 pos 75; > 'Sort ['sum('price) ASC NULLS FIRST], true > +- Project [hotel#17] >+- Filter (sum(cast(price#18 as double))#22 > cast(150 as double)) > +- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS > sum(cast(price#18 as double))#22] > +- SubqueryAlias test > +- View (`test`, [hotel#17,price#18]) >+- Relation [hotel#17,price#18] csv > {code} > The {{AnalysisException}} itself, however, is not caused by the introduced > {{Project}} as can be seen in the following example. Here, {{sum(price)}} is > part of the result and therefore *not* removed using a {{Project}} node. > {code:scala} > sqlcontext.sql("SELECT hotel, sum(price) FROM test GROUP BY hotel HAVING > sum(price) > 150 ORDER BY sum(price)").show{code} > Resolving the aggregate {{sum(price)}} (i.e., resolving it to the aggregate > introduced by the {{Aggregate}} node) is still not successful even if there > is no {{{}Project{}}}. Spark still throws the following {{AnalysisException}} > which is similar to the exception from before. It follows that there is a > second error in the analyzer that still prevents successful resolution even > if the problem regarding the {{Project}} node is fixed. > {code:scala} > org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did > you mean one of the following? [sum(price), test.hotel]; line 1 pos 87; > 'Sort ['sum('price) ASC NULLS FIRST], true > +- Filter (sum(price)#24 > cast(150 as double)) >+- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS > sum(price)#24] > +- SubqueryAlias test > +- View (`test`, [hotel#17,price#18]) > +- Relation [hotel#17,price#18] csv > {code} > > This error occurs (at least) in Spark versions 3.1.2, 3.2.1, as well as the > latest version from the GitHub {{master}} branch. > h2. Current Workaround > The issue can currently be worked around by using a subquery to first > retrieve only the hotels which fulfill the condition and then ordering them > in the outer query: > {code:sql} > SELECT hotel, sum_price FROM > (SELECT hotel, sum(price) AS sum_price FROM test GROUP BY hotel HAVING > sum(price) > 150) sub > ORDER BY sum_price; > {code} > h2. Proposed Solution(s) > The first change fixes the (premature) insertion of {{Project}} before a > {{Sort}} by moving the {{Project}} up in the plan such that the {{Project}} > is then parent of the {{Sort}} instead of vice versa. This does not change > the results of the computations since both {{Sort}} and {{Project}} do not > add or remove tuples from the result. > There are two potential side-effects to this solution: > * May change some plans generated by DataFrame/DataSet which previously also > produced similar errors such that they now yield a result instead. However, > this is unlikely to produce unexpected/undesired
[jira] [Commented] (SPARK-39022) Spark SQL - Combination of HAVING and SORT not resolved correctly
[ https://issues.apache.org/jira/browse/SPARK-39022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528855#comment-17528855 ] Apache Spark commented on SPARK-39022: -- User 'Lukas-Grasmann' has created a pull request for this issue: https://github.com/apache/spark/pull/36378 > Spark SQL - Combination of HAVING and SORT not resolved correctly > - > > Key: SPARK-39022 > URL: https://issues.apache.org/jira/browse/SPARK-39022 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1, 3.4.0 >Reporter: Lukas Grasmann >Priority: Major > Attachments: explain_new.txt, explain_old.txt > > > Example: Given a simple relation {{test}} with two relevant columns {{hotel}} > and {{price}} where {{hotel}} is a unique identifier of a hotel and {{price}} > is the cost of a night's stay. We would then like to order the {{{}hotel{}}}s > by their cumulative prices but only for hotels where the cumulative price is > higher than {{{}150{}}}. > h2. Current Behavior > To achieve the goal specified above, we give a simple query that works in > most common database systems. Note that we only retrieve {{hotel}} in the > {{SELECT ... FROM}} statement which means that the aggregate has to be > removed from the result attributes using a {{Project}} node. > {code:scala} > sqlcontext.sql("SELECT hotel FROM test GROUP BY hotel HAVING sum(price) > 150 > ORDER BY sum(price)").show{code} > Currently, this yields an {{AnalysisException}} since the aggregate > {{sum(price)}} in {{Sort}} is not resolved correctly. Note that the child of > {{Sort}} is a (premature) {{Project}} node which only provides {{hotel}} as > its output. This prevents the aggregate values from being passed to > {{{}Sort{}}}. > {code:scala} > org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did > you mean one of the following? [test.hotel]; line 1 pos 75; > 'Sort ['sum('price) ASC NULLS FIRST], true > +- Project [hotel#17] >+- Filter (sum(cast(price#18 as double))#22 > cast(150 as double)) > +- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS > sum(cast(price#18 as double))#22] > +- SubqueryAlias test > +- View (`test`, [hotel#17,price#18]) >+- Relation [hotel#17,price#18] csv > {code} > The {{AnalysisException}} itself, however, is not caused by the introduced > {{Project}} as can be seen in the following example. Here, {{sum(price)}} is > part of the result and therefore *not* removed using a {{Project}} node. > {code:scala} > sqlcontext.sql("SELECT hotel, sum(price) FROM test GROUP BY hotel HAVING > sum(price) > 150 ORDER BY sum(price)").show{code} > Resolving the aggregate {{sum(price)}} (i.e., resolving it to the aggregate > introduced by the {{Aggregate}} node) is still not successful even if there > is no {{{}Project{}}}. Spark still throws the following {{AnalysisException}} > which is similar to the exception from before. It follows that there is a > second error in the analyzer that still prevents successful resolution even > if the problem regarding the {{Project}} node is fixed. > {code:scala} > org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did > you mean one of the following? [sum(price), test.hotel]; line 1 pos 87; > 'Sort ['sum('price) ASC NULLS FIRST], true > +- Filter (sum(price)#24 > cast(150 as double)) >+- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS > sum(price)#24] > +- SubqueryAlias test > +- View (`test`, [hotel#17,price#18]) > +- Relation [hotel#17,price#18] csv > {code} > > This error occurs (at least) in Spark versions 3.1.2, 3.2.1, as well as the > latest version from the GitHub {{master}} branch. > h2. Current Workaround > The issue can currently be worked around by using a subquery to first > retrieve only the hotels which fulfill the condition and then ordering them > in the outer query: > {code:sql} > SELECT hotel, sum_price FROM > (SELECT hotel, sum(price) AS sum_price FROM test GROUP BY hotel HAVING > sum(price) > 150) sub > ORDER BY sum_price; > {code} > h2. Proposed Solution(s) > The first change fixes the (premature) insertion of {{Project}} before a > {{Sort}} by moving the {{Project}} up in the plan such that the {{Project}} > is then parent of the {{Sort}} instead of vice versa. This does not change > the results of the computations since both {{Sort}} and {{Project}} do not > add or remove tuples from the result. > There are two potential side-effects to this solution: > * May change some plans generated by DataFrame/DataSet which previously also > produced similar errors such that they now yield a result instead. However, > this is unlikely to produce unexpected/undesired
[jira] [Commented] (SPARK-39022) Spark SQL - Combination of HAVING and SORT not resolved correctly
[ https://issues.apache.org/jira/browse/SPARK-39022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528071#comment-17528071 ] Lukas Grasmann commented on SPARK-39022: I am working on a Pull Request for this issue in case the solution described is ok. > Spark SQL - Combination of HAVING and SORT not resolved correctly > - > > Key: SPARK-39022 > URL: https://issues.apache.org/jira/browse/SPARK-39022 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1, 3.4 >Reporter: Lukas Grasmann >Priority: Major > Attachments: explain_new.txt, explain_old.txt > > > Example: Given a simple relation {{test}} with two relevant columns {{hotel}} > and {{price}} where {{hotel}} is a unique identifier of a hotel and {{price}} > is the cost of a night's stay. We would then like to order the {{{}hotel{}}}s > by their cumulative prices but only for hotels where the cumulative price is > higher than {{{}150{}}}. > h2. Current Behavior > To achieve the goal specified above, we give a simple query that works in > most common database systems. Note that we only retrieve {{hotel}} in the > {{SELECT ... FROM}} statement which means that the aggregate has to be > removed from the result attributes using a {{Project}} node. > {code:scala} > sqlcontext.sql("SELECT hotel FROM test GROUP BY hotel HAVING sum(price) > 150 > ORDER BY sum(price)").show{code} > Currently, this yields an {{AnalysisException}} since the aggregate > {{sum(price)}} in {{Sort}} is not resolved correctly. Note that the child of > {{Sort}} is a (premature) {{Project}} node which only provides {{hotel}} as > its output. This prevents the aggregate values from being passed to > {{{}Sort{}}}. > {code:scala} > org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did > you mean one of the following? [test.hotel]; line 1 pos 75; > 'Sort ['sum('price) ASC NULLS FIRST], true > +- Project [hotel#17] >+- Filter (sum(cast(price#18 as double))#22 > cast(150 as double)) > +- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS > sum(cast(price#18 as double))#22] > +- SubqueryAlias test > +- View (`test`, [hotel#17,price#18]) >+- Relation [hotel#17,price#18] csv > {code} > The {{AnalysisException}} itself, however, is not caused by the introduced > {{Project}} as can be seen in the following example. Here, {{sum(price)}} is > part of the result and therefore *not* removed using a {{Project}} node. > {code:scala} > sqlcontext.sql("SELECT hotel, sum(price) FROM test GROUP BY hotel HAVING > sum(price) > 150 ORDER BY sum(price)").show{code} > Resolving the aggregate {{sum(price)}} (i.e., resolving it to the aggregate > introduced by the {{Aggregate}} node) is still not successful even if there > is no {{{}Project{}}}. Spark still throws the following {{AnalysisException}} > which is similar to the exception from before. It follows that there is a > second error in the analyzer that still prevents successful resolution even > if the problem regarding the {{Project}} node is fixed. > {code:scala} > org.apache.spark.sql.AnalysisException: Column 'price' does not exist. Did > you mean one of the following? [sum(price), test.hotel]; line 1 pos 87; > 'Sort ['sum('price) ASC NULLS FIRST], true > +- Filter (sum(price)#24 > cast(150 as double)) >+- Aggregate [HOTEL#17], [hotel#17, sum(cast(price#18 as double)) AS > sum(price)#24] > +- SubqueryAlias test > +- View (`test`, [hotel#17,price#18]) > +- Relation [hotel#17,price#18] csv > {code} > > This error occurs (at least) in Spark versions 3.1.2, 3.2.1, as well as the > latest version from the GitHub {{master}} branch. > h2. Current Workaround > The issue can currently be worked around by using a subquery to first > retrieve only the hotels which fulfill the condition and then ordering them > in the outer query: > {code:sql} > SELECT hotel, sum_price FROM > (SELECT hotel, sum(price) AS sum_price FROM test GROUP BY hotel HAVING > sum(price) > 150) sub > ORDER BY sum_price; > {code} > h2. Proposed Solution(s) > The first change fixes the (premature) insertion of {{Project}} before a > {{Sort}} by moving the {{Project}} up in the plan such that the {{Project}} > is then parent of the {{Sort}} instead of vice versa. This does not change > the results of the computations since both {{Sort}} and {{Project}} do not > add or remove tuples from the result. > There are two potential side-effects to this solution: > * May change some plans generated by DataFrame/DataSet which previously also > produced similar errors such that they now yield a result instead. However, > this is unlikely to produce unexpected/undesired results (see above). > *