[ https://issues.apache.org/jira/browse/SPARK-39302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
shaharl...@gmail.com updated SPARK-39302: ----------------------------------------- Description: SQL parser select wrong field to group by. See following example. Sample data: {code:java} spark.sql(""" select "US" as pv_countryCode union all select "IL" as pv_countryCode """).createOrReplaceTempView("my_test_data") spark.sql(""" select -3 as id, "US" as countryCode """).createOrReplaceTempView("country_codes_sample") {code} code: {code:java} spark.sql(""" WITH BASE (SELECT pv_countryCode AS country_id FROM my_test_data) SELECT COALESCE(cct.id, -3) as country_id, count(1) as count FROM BASE LEFT JOIN country_codes_sample cct ON cct.countryCode = country_id GROUP BY country_id """).createOrReplaceTempView("campaign_country_bug") {code} Error: {code:java} org.apache.spark.sql.AnalysisException: expression 'cct.`id`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; Aggregate [country_id#910], [coalesce(id#886, -3) AS country_id#908, count(1) AS count#909L] +- Join LeftOuter, (countryCode#887 = country_id#910) :- SubqueryAlias BASE : +- Project [pv_countryCode#883 AS country_id#910] : +- SubqueryAlias my_test_data : +- Union false, false : :- Project [US AS pv_countryCode#883] : : +- OneRowRelation : +- Project [IL AS pv_countryCode#884] : +- OneRowRelation +- SubqueryAlias cct +- SubqueryAlias country_codes_sample +- Project [-3 AS id#886, US AS countryCode#887] +- OneRowRelation {code} I expected Spark to choose the selected country_id(country_id#908) instead of country_id#910 Or at least throw ambiguous exception when grouping by `country_id`. This lead developers to add `cct.id` into group by which results in unexpected results. (In case country_id has both null and -3 values) was: SQL parser select wrong field to group by. See following example. Sample data: {code:java} spark.sql(""" select "US" as pv_countryCode union all select "IL" as pv_countryCode """).createOrReplaceTempView("my_test_data") spark.sql(""" select -3 as id, "US" as countryCode """).createOrReplaceTempView("country_codes_sample") {code} code: {code:java} spark.sql(""" select "US" as pv_countryCode union all select "IL" as pv_countryCode """).createOrReplaceTempView("my_test_data") spark.sql(""" select -3 as id, "US" as countryCode """).createOrReplaceTempView("country_codes_sample") {code} Error: {code:java} org.apache.spark.sql.AnalysisException: expression 'cct.`id`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; Aggregate [country_id#910], [coalesce(id#886, -3) AS country_id#908, count(1) AS count#909L] +- Join LeftOuter, (countryCode#887 = country_id#910) :- SubqueryAlias BASE : +- Project [pv_countryCode#883 AS country_id#910] : +- SubqueryAlias my_test_data : +- Union false, false : :- Project [US AS pv_countryCode#883] : : +- OneRowRelation : +- Project [IL AS pv_countryCode#884] : +- OneRowRelation +- SubqueryAlias cct +- SubqueryAlias country_codes_sample +- Project [-3 AS id#886, US AS countryCode#887] +- OneRowRelation {code} I expected Spark to choose the selected country_id(country_id#908) instead of country_id#910 Or at least throw ambiguous exception when grouping by `country_id`. This lead developers to add `cct.id` into group by which results in unexpected results. (In case country_id has both null and -3 values) > Spark SQL - wrong field selection in group by > --------------------------------------------- > > Key: SPARK-39302 > URL: https://issues.apache.org/jira/browse/SPARK-39302 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.2 > Reporter: shaharl...@gmail.com > Priority: Minor > > SQL parser select wrong field to group by. > See following example. > Sample data: > > {code:java} > spark.sql(""" > select "US" as pv_countryCode > union all > select "IL" as pv_countryCode > """).createOrReplaceTempView("my_test_data") > spark.sql(""" > select -3 as id, "US" as countryCode > """).createOrReplaceTempView("country_codes_sample") > {code} > code: > {code:java} > > spark.sql(""" > WITH BASE > (SELECT > pv_countryCode AS country_id > FROM my_test_data) > SELECT > COALESCE(cct.id, -3) as country_id, > count(1) as count > FROM BASE > LEFT JOIN country_codes_sample cct ON cct.countryCode = > country_id > GROUP BY > country_id > """).createOrReplaceTempView("campaign_country_bug") {code} > Error: > {code:java} > org.apache.spark.sql.AnalysisException: expression 'cct.`id`' is neither > present in the group by, nor is it an aggregate function. Add to group by or > wrap in first() (or first_value) if you don't care which value you get.; > Aggregate [country_id#910], [coalesce(id#886, -3) AS country_id#908, count(1) > AS count#909L] > +- Join LeftOuter, (countryCode#887 = country_id#910) > :- SubqueryAlias BASE > : +- Project [pv_countryCode#883 AS country_id#910] > : +- SubqueryAlias my_test_data > : +- Union false, false > : :- Project [US AS pv_countryCode#883] > : : +- OneRowRelation > : +- Project [IL AS pv_countryCode#884] > : +- OneRowRelation > +- SubqueryAlias cct > +- SubqueryAlias country_codes_sample > +- Project [-3 AS id#886, US AS countryCode#887] > +- OneRowRelation {code} > I expected Spark to choose the selected country_id(country_id#908) instead of > country_id#910 > Or at least throw ambiguous exception when grouping by `country_id`. > This lead developers to add `cct.id` into group by which results in > unexpected results. > (In case country_id has both null and -3 values) > -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org