[ https://issues.apache.org/jira/browse/SPARK-39302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554015#comment-17554015 ]
Sean R. Owen commented on SPARK-39302: -------------------------------------- The error seems correct. You group the result of the join by country_id, but in the join that definitely resolves to country_id 910, as there is no other country_id. The projection happens after, which creates the second country_id. Group the result of the join after the projection. > Spark SQL - wrong field selection in group by > --------------------------------------------- > > Key: SPARK-39302 > URL: https://issues.apache.org/jira/browse/SPARK-39302 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.2 > Reporter: shaharl...@gmail.com > Priority: Minor > > SQL parser select wrong field to group by. > See following example. > Sample data: > > {code:java} > spark.sql(""" > select "US" as pv_countryCode > union all > select "IL" as pv_countryCode > """).createOrReplaceTempView("my_test_data") > spark.sql(""" > select -3 as id, "US" as countryCode > """).createOrReplaceTempView("country_codes_sample") > {code} > code: > {code:java} > > spark.sql(""" > WITH BASE > (SELECT > pv_countryCode AS country_id > FROM my_test_data) > SELECT > COALESCE(cct.id, -3) as country_id, > count(1) as count > FROM BASE > LEFT JOIN country_codes_sample cct ON cct.countryCode = > country_id > GROUP BY > country_id > """).createOrReplaceTempView("campaign_country_bug") {code} > Error: > {code:java} > org.apache.spark.sql.AnalysisException: expression 'cct.`id`' is neither > present in the group by, nor is it an aggregate function. Add to group by or > wrap in first() (or first_value) if you don't care which value you get.; > Aggregate [country_id#910], [coalesce(id#886, -3) AS country_id#908, count(1) > AS count#909L] > +- Join LeftOuter, (countryCode#887 = country_id#910) > :- SubqueryAlias BASE > : +- Project [pv_countryCode#883 AS country_id#910] > : +- SubqueryAlias my_test_data > : +- Union false, false > : :- Project [US AS pv_countryCode#883] > : : +- OneRowRelation > : +- Project [IL AS pv_countryCode#884] > : +- OneRowRelation > +- SubqueryAlias cct > +- SubqueryAlias country_codes_sample > +- Project [-3 AS id#886, US AS countryCode#887] > +- OneRowRelation {code} > I expected Spark to choose the selected country_id(country_id#908) instead of > country_id#910 > Or at least throw ambiguous exception when grouping by `country_id`. > This lead developers to add `cct.id` into group by which results in > unexpected results. > (In case country_id has both null and -3 values) > -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org