[ https://issues.apache.org/jira/browse/SPARK-11231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968608#comment-14968608 ]
Sun Rui commented on SPARK-11231: --------------------------------- [~shivaram] This feature can be done on R side even though it is supported on Scala side. But if Scala side can support this, R implementation can be simplified. > join returns schema with duplicated and ambiguous join columns > -------------------------------------------------------------- > > Key: SPARK-11231 > URL: https://issues.apache.org/jira/browse/SPARK-11231 > Project: Spark > Issue Type: Bug > Components: SparkR > Affects Versions: 1.5.1 > Environment: R > Reporter: Matt Pollock > > In the case where the key column of two data frames are named the same thing, > join returns a data frame where that column is duplicated. Since the content > of the columns is guaranteed to be the same by row consolidating the > identical columns into a single column would replicate standard R behavior[1] > and help prevent ambiguous names. > Example: > {code} > > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3)) > > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6)) > > sdf1 <- createDataFrame(sqlContext, df1) > > sdf2 <- createDataFrame(sqlContext, df2) > > sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner") > > schema(sjdf) > StructType > |-name = "key", type = "StringType", nullable = TRUE > |-name = "value1", type = "DoubleType", nullable = TRUE > |-name = "key", type = "StringType", nullable = TRUE > |-name = "value2", type = "DoubleType", nullable = TRUE > {code} > The duplicated key columns cause things like: > {code} > > library(magrittr) > > sjdf %>% select("key") > 15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed > Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : > org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could > be: key#125, key#127.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399) > at org.apache.spark.sql.catalyst.tree > {code} > [1] In base R there is no"join", but a similar function "merge" is provided > in which a "by" argument identifies the shared key column in the two data > frames. In the case where the key column names differ "by.x" and "by.y" > arguments can be used. In the case of same-named key columns the > consolidation behavior requested above is observed. In the case of differing > names they "by.x" name is retained and consolidated with the "by.y" column > which is dropped. > {code} > > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3)) > > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6)) > > merge(df1, df2, by="key") > key value1 value2 > 1 A 1 4 > 2 B 2 5 > 3 C 3 6 > df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3)) > > merge(df2, df3, by.x="key", by.y="akey") > key value2 value1 > 1 A 4 1 > 2 B 5 2 > 3 C 6 3 > > merge(df3, df2, by.x="akey", by.y="key") > akey value1 value2 > 1 A 1 4 > 2 B 2 5 > 3 C 3 6 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org