[ https://issues.apache.org/jira/browse/SPARK-42199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-42199: ----------------------------------- Labels: pull-request-available (was: ) > groupByKey creates columns that may conflict with exising columns > ----------------------------------------------------------------- > > Key: SPARK-42199 > URL: https://issues.apache.org/jira/browse/SPARK-42199 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.3, 3.1.3, 3.2.3, 3.3.2, 3.4.0, 3.5.0 > Reporter: Enrico Minack > Priority: Major > Labels: pull-request-available > > Calling {{ds.groupByKey(func: V => K)}} creates columns to store the key > value. These columns may conflict with columns that already exist in {{ds}}. > Function {{Dataset.groupByKey.agg}} accounts for this with a very specific > rule, which has some surprising weaknesses: > {code:scala} > spark.range(1) > // groupByKey adds column 'value' > .groupByKey(id => id) > // which cannot be referenced, though it is suggested > .agg(count("value")) > {code} > {code:java} > org.apache.spark.sql.AnalysisException: Column 'value' does not exist. Did > you mean one of the following? [value, id]; > {code} > An existing 'value' column can be referenced: > {code:scala} > // dataset with column 'value' > spark.range(1).select($"id".as("value")).as[Long] > // groupByKey adds another column 'value' > .groupByKey(id => id) > // agg accounts for the extra column and excludes it when resolving 'value' > .agg(count("value")) > .show() > {code} > {code:java} > +---+------------+ > |key|count(value)| > +---+------------+ > | 0| 1| > +---+------------+ > {code} > While column suggestion shows both 'value' columns: > {code:scala} > spark.range(1).select($"id".as("value")).as[Long] > .groupByKey(id => id) > .agg(count("unknown")) > {code} > {code:java} > org.apache.spark.sql.AnalysisException: Column 'unknown' does not exist. Did > you mean one of the following? [value, value] > {code} > However, {{mapValues}} introduces another 'value' column, which should be > referencable, but it breaks the exclusion introduced by {{agg}}: > {code:scala} > spark.range(1) > // groupByKey adds column 'value' > .groupByKey(id => id) > // adds another 'value' column > .mapValues(value => value) > // which cannot be referenced in agg > .agg(count("value")) > {code} > {code:java} > org.apache.spark.sql.AnalysisException: Reference 'value' is ambiguous, could > be: value, value. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org