[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

Alexis Seigneurin (JIRA) Wed, 14 Oct 2015 13:52:38 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957731#comment-14957731
 ]


Alexis Seigneurin commented on SPARK-10925:
-------------------------------------------

Well, technically, it's not a duplicate column. An inner join between two 
Dataframes on a column that carries the same name on both sides is supposed to 
work and to only retain one column.

I had noticed that renaming one of the columns was a workaround and that's what 
I'm doing before this issue gets fixed.

One thing to note, though, is that this code used to work with Spark 1.4 (I 
have only adjusted the call to the UDFs to use the new API). This means there 
must be a regression in the query analyzer.

> Exception when joining DataFrames
> ---------------------------------
>
>                 Key: SPARK-10925
>                 URL: https://issues.apache.org/jira/browse/SPARK-10925
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0, 1.5.1
>         Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>            Reporter: Alexis Seigneurin
>         Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>       at scala.collection.immutable.List.foreach(List.scala:318)
>       at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>       at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>       at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>       at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:132)
>       at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>       at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>       at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>       at TestCase2$.main(TestCase2.scala:51)
>       at TestCase2.main(TestCase2.scala)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:497)
>       at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> {code}
> I'm attaching a test case that I tried with Spark 1.5.0 and 1.5.1. Please 
> note it used to work with version 1.4.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

Reply via email to