[ https://issues.apache.org/jira/browse/SPARK-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16041434#comment-16041434 ]
Takeshi Yamamuro edited comment on SPARK-17237 at 6/7/17 7:04 PM: ------------------------------------------------------------------ Thanks for the report. I think there are two points you suggestedt: a qualifier and buck-ticks. Yea, you're right and it seems my pr above wrongly drop a qualifier for aggregated column names(then, it changed the behaviour). {code} // Spark-v2.1 scala> Seq((1, 2)).toDF("id", "v1").createOrReplaceTempView("s") scala> Seq((1, 2)).toDF("id", "v2").createOrReplaceTempView("t") scala> val df1 = sql("SELECT * FROM s") df1: org.apache.spark.sql.DataFrame = [id: int, v1: int] scala> val df2 = sql("SELECT * FROM t") df2: org.apache.spark.sql.DataFrame = [id: int, v2: int] scala> df1.join(df2, "id" :: Nil).groupBy("id").pivot("id").max("v1", "v2").show +---+-------------+-------------+ | id|1_max(s.`v1`)|1_max(t.`v2`)| +---+-------------+-------------+ | 1| 2| 2| +---+-------------+-------------+ // Master scala> df1.join(df2, "id" :: Nil).groupBy("id").pivot("id").max("v1", "v2").show +---+---------+---------+ | id|1_max(v1)|1_max(v2)| +---+---------+---------+ | 1| 2| 2| +---+---------+---------+ {code} We could easily fix this, but I'm not 100% sure that we need to fix this. WDYT? cc: [~smilegator] {code} // Master with a patch (https://github.com/apache/spark/compare/master...maropu:SPARK-17237-4) scala> df1.join(df2, "id" :: Nil).groupBy("id").pivot("id").max("v1", "v2").show +---+-----------+-----------+ | id|1_max(s.v1)|1_max(t.v2)| +---+-----------+-----------+ | 1| 2| 2| +---+-----------+-----------+ {code} On the other hand, IIUC back-ticks are not allowed in column names cuz they have special meaning in Spark. was (Author: maropu): Thanks for the report. I think there are two points you pointed out: a qualifier and buck-ticks. Yea, you're right and it seems my pr above wrongly drop a qualifier for aggregated column names(then, it changed the behaviour). {code} // Spark-v2.1 scala> Seq((1, 2)).toDF("id", "v1").createOrReplaceTempView("s") scala> Seq((1, 2)).toDF("id", "v2").createOrReplaceTempView("t") scala> val df1 = sql("SELECT * FROM s") df1: org.apache.spark.sql.DataFrame = [id: int, v1: int] scala> val df2 = sql("SELECT * FROM t") df2: org.apache.spark.sql.DataFrame = [id: int, v2: int] scala> df1.join(df2, "id" :: Nil).groupBy("id").pivot("id").max("v1", "v2").show +---+-------------+-------------+ | id|1_max(s.`v1`)|1_max(t.`v2`)| +---+-------------+-------------+ | 1| 2| 2| +---+-------------+-------------+ // Master scala> df1.join(df2, "id" :: Nil).groupBy("id").pivot("id").max("v1", "v2").show +---+---------+---------+ | id|1_max(v1)|1_max(v2)| +---+---------+---------+ | 1| 2| 2| +---+---------+---------+ {code} We could easily fix this, but I'm not 100% sure that we need to fix this. WDYT? cc: [~smilegator] {code} // Master with a patch (https://github.com/apache/spark/compare/master...maropu:SPARK-17237-4) scala> df1.join(df2, "id" :: Nil).groupBy("id").pivot("id").max("v1", "v2").show +---+-----------+-----------+ | id|1_max(s.v1)|1_max(t.v2)| +---+-----------+-----------+ | 1| 2| 2| +---+-----------+-----------+ {code} On the other hand, IIUC back-ticks are not allowed in column names cuz they have special meaning in Spark. > DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException > ------------------------------------------------------------------------- > > Key: SPARK-17237 > URL: https://issues.apache.org/jira/browse/SPARK-17237 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Jiang Qiqi > Assignee: Takeshi Yamamuro > Labels: newbie > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > I am trying to run a pivot transformation which I ran on a spark1.6 cluster, > namely > sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c") > res1: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int] > scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0) > res2: org.apache.spark.sql.DataFrame = [a: int, 3_count(c): bigint, 3_avg(c): > double, 4_count(c): bigint, 4_avg(c): double] > scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0).show > +---+----------+--------+----------+--------+ > | a|3_count(c)|3_avg(c)|4_count(c)|4_avg(c)| > +---+----------+--------+----------+--------+ > | 2| 1| 4.0| 0| 0.0| > | 3| 0| 0.0| 1| 5.0| > +---+----------+--------+----------+--------+ > after upgrade the environment to spark2.0, got an error while executing > .na.fill method > scala> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c") > res3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> res3.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0) > org.apache.spark.sql.AnalysisException: syntax error in attribute name: > `3_count(`c`)`; > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:103) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:113) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218) > at org.apache.spark.sql.Dataset.col(Dataset.scala:921) > at > org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:149) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134) -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org