[jira] [Commented] (SPARK-39302) Spark SQL - wrong field selection in group by
[ https://issues.apache.org/jira/browse/SPARK-39302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554015#comment-17554015 ] Sean R. Owen commented on SPARK-39302: -- The error seems correct. You group the result of the join by country_id, but in the join that definitely resolves to country_id 910, as there is no other country_id. The projection happens after, which creates the second country_id. Group the result of the join after the projection. > Spark SQL - wrong field selection in group by > - > > Key: SPARK-39302 > URL: https://issues.apache.org/jira/browse/SPARK-39302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: shaharl...@gmail.com >Priority: Minor > > SQL parser select wrong field to group by. > See following example. > Sample data: > > {code:java} > spark.sql(""" > select "US" as pv_countryCode > union all > select "IL" as pv_countryCode > """).createOrReplaceTempView("my_test_data") > spark.sql(""" > select -3 as id, "US" as countryCode > """).createOrReplaceTempView("country_codes_sample") > {code} > code: > {code:java} > > spark.sql(""" > WITH BASE > (SELECT > pv_countryCode AS country_id > FROM my_test_data) > SELECT > COALESCE(cct.id, -3) as country_id, > count(1) as count > FROM BASE > LEFT JOIN country_codes_sample cct ON cct.countryCode = > country_id > GROUP BY > country_id > """).createOrReplaceTempView("campaign_country_bug") {code} > Error: > {code:java} > org.apache.spark.sql.AnalysisException: expression 'cct.`id`' is neither > present in the group by, nor is it an aggregate function. Add to group by or > wrap in first() (or first_value) if you don't care which value you get.; > Aggregate [country_id#910], [coalesce(id#886, -3) AS country_id#908, count(1) > AS count#909L] > +- Join LeftOuter, (countryCode#887 = country_id#910) >:- SubqueryAlias BASE >: +- Project [pv_countryCode#883 AS country_id#910] >: +- SubqueryAlias my_test_data >:+- Union false, false >: :- Project [US AS pv_countryCode#883] >: : +- OneRowRelation >: +- Project [IL AS pv_countryCode#884] >: +- OneRowRelation >+- SubqueryAlias cct > +- SubqueryAlias country_codes_sample > +- Project [-3 AS id#886, US AS countryCode#887] > +- OneRowRelation {code} > I expected Spark to choose the selected country_id(country_id#908) instead of > country_id#910 > Or at least throw ambiguous exception when grouping by `country_id`. > This lead developers to add `cct.id` into group by which results in > unexpected results. > (In case country_id has both null and -3 values) > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39302) Spark SQL - wrong field selection in group by
[ https://issues.apache.org/jira/browse/SPARK-39302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553939#comment-17553939 ] shaharl...@gmail.com commented on SPARK-39302: -- [~srowen] , You are absolutely right! Not sure what I was going through at that moment... :D > Spark SQL - wrong field selection in group by > - > > Key: SPARK-39302 > URL: https://issues.apache.org/jira/browse/SPARK-39302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: shaharl...@gmail.com >Priority: Minor > > SQL parser select wrong field to group by. > See following example. > Sample data: > > {code:java} > spark.sql(""" > select "US" as pv_countryCode > union all > select "IL" as pv_countryCode > """).createOrReplaceTempView("my_test_data") > spark.sql(""" > select -3 as id, "US" as countryCode > """).createOrReplaceTempView("country_codes_sample") > {code} > code: > {code:java} > > spark.sql(""" > WITH BASE > (SELECT > pv_countryCode AS country_id > FROM my_test_data) > SELECT > COALESCE(cct.id, -3) as country_id, > count(1) as count > FROM BASE > LEFT JOIN country_codes_sample cct ON cct.countryCode = > country_id > GROUP BY > country_id > """).createOrReplaceTempView("campaign_country_bug") {code} > Error: > {code:java} > org.apache.spark.sql.AnalysisException: expression 'cct.`id`' is neither > present in the group by, nor is it an aggregate function. Add to group by or > wrap in first() (or first_value) if you don't care which value you get.; > Aggregate [country_id#910], [coalesce(id#886, -3) AS country_id#908, count(1) > AS count#909L] > +- Join LeftOuter, (countryCode#887 = country_id#910) >:- SubqueryAlias BASE >: +- Project [pv_countryCode#883 AS country_id#910] >: +- SubqueryAlias my_test_data >:+- Union false, false >: :- Project [US AS pv_countryCode#883] >: : +- OneRowRelation >: +- Project [IL AS pv_countryCode#884] >: +- OneRowRelation >+- SubqueryAlias cct > +- SubqueryAlias country_codes_sample > +- Project [-3 AS id#886, US AS countryCode#887] > +- OneRowRelation {code} > I expected Spark to choose the selected country_id(country_id#908) instead of > country_id#910 > Or at least throw ambiguous exception when grouping by `country_id`. > This lead developers to add `cct.id` into group by which results in > unexpected results. > (In case country_id has both null and -3 values) > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39302) Spark SQL - wrong field selection in group by
[ https://issues.apache.org/jira/browse/SPARK-39302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553825#comment-17553825 ] Sean R. Owen commented on SPARK-39302: -- I don't understand this - you didn't show a query with a group by? > Spark SQL - wrong field selection in group by > - > > Key: SPARK-39302 > URL: https://issues.apache.org/jira/browse/SPARK-39302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: shaharl...@gmail.com >Priority: Minor > > SQL parser select wrong field to group by. > See following example. > Sample data: > > {code:java} > spark.sql(""" > select "US" as pv_countryCode > union all > select "IL" as pv_countryCode > """).createOrReplaceTempView("my_test_data") > spark.sql(""" > select -3 as id, "US" as countryCode > """).createOrReplaceTempView("country_codes_sample") > {code} > code: > {code:java} > spark.sql(""" > select "US" as pv_countryCode > union all > select "IL" as pv_countryCode > """).createOrReplaceTempView("my_test_data") > spark.sql(""" > select -3 as id, "US" as countryCode > """).createOrReplaceTempView("country_codes_sample") > {code} > Error: > {code:java} > org.apache.spark.sql.AnalysisException: expression 'cct.`id`' is neither > present in the group by, nor is it an aggregate function. Add to group by or > wrap in first() (or first_value) if you don't care which value you get.; > Aggregate [country_id#910], [coalesce(id#886, -3) AS country_id#908, count(1) > AS count#909L] > +- Join LeftOuter, (countryCode#887 = country_id#910) >:- SubqueryAlias BASE >: +- Project [pv_countryCode#883 AS country_id#910] >: +- SubqueryAlias my_test_data >:+- Union false, false >: :- Project [US AS pv_countryCode#883] >: : +- OneRowRelation >: +- Project [IL AS pv_countryCode#884] >: +- OneRowRelation >+- SubqueryAlias cct > +- SubqueryAlias country_codes_sample > +- Project [-3 AS id#886, US AS countryCode#887] > +- OneRowRelation {code} > I expected Spark to choose the selected country_id(country_id#908) instead of > country_id#910 > Or at least throw ambiguous exception when grouping by `country_id`. > This lead developers to add `cct.id` into group by which results in > unexpected results. > (In case country_id has both null and -3 values) > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org