[
https://issues.apache.org/jira/browse/SPARK-11231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-11231.
----------------------------------
Resolution: Incomplete
> join returns schema with duplicated and ambiguous join columns
> --------------------------------------------------------------
>
> Key: SPARK-11231
> URL: https://issues.apache.org/jira/browse/SPARK-11231
> Project: Spark
> Issue Type: Bug
> Components: SparkR
> Affects Versions: 1.5.1
> Environment: R
> Reporter: Matt Pollock
> Priority: Major
> Labels: bulk-closed
>
> In the case where the key column of two data frames are named the same thing,
> join returns a data frame where that column is duplicated. Since the content
> of the columns is guaranteed to be the same by row consolidating the
> identical columns into a single column would replicate standard R behavior[1]
> and help prevent ambiguous names.
> Example:
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > sdf1 <- createDataFrame(sqlContext, df1)
> > sdf2 <- createDataFrame(sqlContext, df2)
> > sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner")
> > schema(sjdf)
> StructType
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value1", type = "DoubleType", nullable = TRUE
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value2", type = "DoubleType", nullable = TRUE
> {code}
> The duplicated key columns cause things like:
> {code}
> > library(magrittr)
> > sjdf %>% select("key")
> 15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
> org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could
> be: key#125, key#127.;
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
> at
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399)
> at org.apache.spark.sql.catalyst.tree
> {code}
> [1] In base R there is no"join", but a similar function "merge" is provided
> in which a "by" argument identifies the shared key column in the two data
> frames. In the case where the key column names differ "by.x" and "by.y"
> arguments can be used. In the case of same-named key columns the
> consolidation behavior requested above is observed. In the case of differing
> names they "by.x" name is retained and consolidated with the "by.y" column
> which is dropped.
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > merge(df1, df2, by="key")
> key value1 value2
> 1 A 1 4
> 2 B 2 5
> 3 C 3 6
> df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3))
> > merge(df2, df3, by.x="key", by.y="akey")
> key value2 value1
> 1 A 4 1
> 2 B 5 2
> 3 C 6 3
> > merge(df3, df2, by.x="akey", by.y="key")
> akey value1 value2
> 1 A 1 4
> 2 B 2 5
> 3 C 3 6
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]