[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340412#comment-16340412 ]
Henry Robinson commented on SPARK-23157: ---------------------------------------- I'm not sure if this should actually be expected to work. {{Dataset.map()}} will always return a dataset with a logical plan that's different to the original, so {{ds.map(a => a).col("id")}} has an expression that refers to an attribute ID that isn't produced by the original dataset. It seems like the requirement for {{ds.withColumn()}} is that the column argument is an expression over {{ds}}'s logical plan. You get the same error doing the following, which is more explicit about these being two separate datasets. {code:java} scala> val ds = spark.createDataset(Seq(R("1"))) ds: org.apache.spark.sql.Dataset[R] = [id: string] scala> val ds2 = spark.createDataset(Seq(R("1"))) ds2: org.apache.spark.sql.Dataset[R] = [id: string] scala> ds.withColumn("id2", ds2.col("id")) org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#113 missing from id#1 in operator !Project [id#1, id#113 AS id2#115]. Attribute(s) with the same name appear in the operation: id. Please check if the right attribute(s) are used.;; !Project [id#1, id#113 AS id2#115] +- LocalRelation [id#1] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:297) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3286) at org.apache.spark.sql.Dataset.select(Dataset.scala:1303) at org.apache.spark.sql.Dataset.withColumns(Dataset.scala:2185) at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:2152) ... 49 elided {code} If the {{map}} function weren't the identity, would you expect this still to work? > withColumn fails for a column that is a result of mapped DataSet > ---------------------------------------------------------------- > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.1 > Reporter: Tomasz Bartczak > Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org