[jira] [Comment Edited] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException

Takeshi Yamamuro (JIRA) Wed, 07 Jun 2017 12:05:43 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16041434#comment-16041434
 ]


Takeshi Yamamuro edited comment on SPARK-17237 at 6/7/17 7:04 PM:
------------------------------------------------------------------

Thanks for the report.
I think there are two points you suggestedt: a qualifier and buck-ticks.
Yea, you're right and it seems my pr above wrongly drop a qualifier for 
aggregated column names(then, it changed the behaviour).
{code}
// Spark-v2.1
scala> Seq((1, 2)).toDF("id", "v1").createOrReplaceTempView("s")

scala> Seq((1, 2)).toDF("id", "v2").createOrReplaceTempView("t")

scala> val df1 = sql("SELECT * FROM s")
df1: org.apache.spark.sql.DataFrame = [id: int, v1: int]

scala> val df2 = sql("SELECT * FROM t")
df2: org.apache.spark.sql.DataFrame = [id: int, v2: int]

scala> df1.join(df2, "id" :: Nil).groupBy("id").pivot("id").max("v1", "v2").show
+---+-------------+-------------+                                               
| id|1_max(s.`v1`)|1_max(t.`v2`)|
+---+-------------+-------------+
|  1|            2|            2|
+---+-------------+-------------+

// Master
scala> df1.join(df2, "id" :: Nil).groupBy("id").pivot("id").max("v1", "v2").show
+---+---------+---------+                                                       
| id|1_max(v1)|1_max(v2)|
+---+---------+---------+
|  1|        2|        2|
+---+---------+---------+
{code}

We could easily fix this, but I'm not 100% sure that we need to fix this. WDYT? 
cc: [~smilegator]

{code}
// Master with a patch 
(https://github.com/apache/spark/compare/master...maropu:SPARK-17237-4)
scala> df1.join(df2, "id" :: Nil).groupBy("id").pivot("id").max("v1", "v2").show
+---+-----------+-----------+                                                   
    
| id|1_max(s.v1)|1_max(t.v2)|
+---+-----------+-----------+
|  1|          2|          2|
+---+-----------+-----------+
{code}

On the other hand, IIUC back-ticks are not allowed in column names cuz they 
have special meaning in Spark.


was (Author: maropu):
Thanks for the report.
I think there are two points you pointed out: a qualifier and buck-ticks.
Yea, you're right and it seems my pr above wrongly drop a qualifier for 
aggregated column names(then, it changed the behaviour).
{code}
// Spark-v2.1
scala> Seq((1, 2)).toDF("id", "v1").createOrReplaceTempView("s")

scala> Seq((1, 2)).toDF("id", "v2").createOrReplaceTempView("t")

scala> val df1 = sql("SELECT * FROM s")
df1: org.apache.spark.sql.DataFrame = [id: int, v1: int]

scala> val df2 = sql("SELECT * FROM t")
df2: org.apache.spark.sql.DataFrame = [id: int, v2: int]

scala> df1.join(df2, "id" :: Nil).groupBy("id").pivot("id").max("v1", "v2").show
+---+-------------+-------------+                                               
| id|1_max(s.`v1`)|1_max(t.`v2`)|
+---+-------------+-------------+
|  1|            2|            2|
+---+-------------+-------------+

// Master
scala> df1.join(df2, "id" :: Nil).groupBy("id").pivot("id").max("v1", "v2").show
+---+---------+---------+                                                       
| id|1_max(v1)|1_max(v2)|
+---+---------+---------+
|  1|        2|        2|
+---+---------+---------+
{code}

We could easily fix this, but I'm not 100% sure that we need to fix this. WDYT? 
cc: [~smilegator]

{code}
// Master with a patch 
(https://github.com/apache/spark/compare/master...maropu:SPARK-17237-4)
scala> df1.join(df2, "id" :: Nil).groupBy("id").pivot("id").max("v1", "v2").show
+---+-----------+-----------+                                                   
    
| id|1_max(s.v1)|1_max(t.v2)|
+---+-----------+-----------+
|  1|          2|          2|
+---+-----------+-----------+
{code}

On the other hand, IIUC back-ticks are not allowed in column names cuz they 
have special meaning in Spark.

> DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException
> -------------------------------------------------------------------------
>
>                 Key: SPARK-17237
>                 URL: https://issues.apache.org/jira/browse/SPARK-17237
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Jiang Qiqi
>            Assignee: Takeshi Yamamuro
>              Labels: newbie
>             Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> I am trying to run a pivot transformation which I ran on a spark1.6 cluster, 
> namely
> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res1: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> res2: org.apache.spark.sql.DataFrame = [a: int, 3_count(c): bigint, 3_avg(c): 
> double, 4_count(c): bigint, 4_avg(c): double]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0).show
> +---+----------+--------+----------+--------+
> |  a|3_count(c)|3_avg(c)|4_count(c)|4_avg(c)|
> +---+----------+--------+----------+--------+
> |  2|         1|     4.0|         0|     0.0|
> |  3|         0|     0.0|         1|     5.0|
> +---+----------+--------+----------+--------+
> after upgrade the environment to spark2.0, got an error while executing 
> .na.fill method
> scala> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> res3.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `3_count(`c`)`;
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:113)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:921)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:149)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException

Reply via email to