[jira] [Commented] (SPARK-11231) join returns schema with duplicated and ambiguous join columns

Shivaram Venkataraman (JIRA) Wed, 21 Oct 2015 09:07:09 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-11231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967364#comment-14967364
 ]


Shivaram Venkataraman commented on SPARK-11231:
-----------------------------------------------

[~Narine] [~sunrui] Is this covered by 
https://github.com/apache/spark/pull/9012 or does this require some changes on 
the scala side ?

cc [~davies]

> join returns schema with duplicated and ambiguous join columns
> --------------------------------------------------------------
>
>                 Key: SPARK-11231
>                 URL: https://issues.apache.org/jira/browse/SPARK-11231
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 1.5.1
>         Environment: R
>            Reporter: Matt Pollock
>
> In the case where the key column of two data frames are named the same thing, 
> join returns a data frame where that column is duplicated. Since the content 
> of the columns is guaranteed to be the same by row consolidating the 
> identical columns into a single column would replicate standard R behavior[1] 
> and help prevent ambiguous names.
> Example:
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > sdf1 <- createDataFrame(sqlContext, df1)
> > sdf2 <- createDataFrame(sqlContext, df2)
> > sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner")
> > schema(sjdf)
> StructType
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value1", type = "DoubleType", nullable = TRUE
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value2", type = "DoubleType", nullable = TRUE
> {code}
> The duplicated key columns cause things like:
> {code}
> > library(magrittr)
> > sjdf %>% select("key")
> 15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could 
> be: key#125, key#127.;
>       at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
>       at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
>       at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399)
>       at org.apache.spark.sql.catalyst.tree
> {code}
> [1] In base R there is no"join", but a similar function "merge" is provided 
> in which a "by" argument identifies the shared key column in the two data 
> frames. In the case where the key column names differ "by.x" and "by.y" 
> arguments can be used. In the case of same-named key columns the 
> consolidation behavior requested above is observed. In the case of differing 
> names they "by.x" name is retained and consolidated with the "by.y" column 
> which is dropped.
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > merge(df1, df2, by="key")
>   key value1 value2
> 1   A      1      4
> 2   B      2      5
> 3   C      3      6
> df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3))
> > merge(df2, df3, by.x="key", by.y="akey")
>   key value2 value1
> 1   A      4      1
> 2   B      5      2
> 3   C      6      3
> > merge(df3, df2, by.x="akey", by.y="key")
>   akey value1 value2
> 1    A      1      4
> 2    B      2      5
> 3    C      3      6
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11231) join returns schema with duplicated and ambiguous join columns

Reply via email to