[ https://issues.apache.org/jira/browse/SPARK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Monica Liu updated SPARK-10981: ------------------------------- Description: I am using SparkR from RStudio, and I ran into an error with the join function that I recreated with a smaller example: {code:title=joinTest.R|borderStyle=solid} Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) library(SparkR) sc <- sparkR.init("local[4]") sqlContext <- sparkRSQL.init(sc) n = c(2, 3, 5) s = c("aa", "bb", "cc") b = c(TRUE, FALSE, TRUE) df = data.frame(n, s, b) df1= createDataFrame(sqlContext, df) showDF(df1) x = c(2, 3, 10) t = c("dd", "ee", "ff") c = c(FALSE, FALSE, TRUE) dff = data.frame(x, t, c) df2 = createDataFrame(sqlContext, dff) showDF(df2) res = join(df1, df2, df1$n == df2$x, "semijoin") showDF(res) {code} Running this code, I encountered the error: {panel} Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. Supported join types include: 'inner', 'outer', 'full', 'fullouter', 'leftouter', 'left', 'rightouter', 'right', 'leftsemi'. {panel} However, if I changed the joinType to "leftsemi", {code} res = join(df1, df2, df1$n == df2$x, "leftsemi") {code} I would get the error: {panel} Error in .local(x, y, ...) : joinType must be one of the following types: 'inner', 'outer', 'left_outer', 'right_outer', 'semijoin' {panel} Since the join function in R appears to invoke a Java method, I went into DataFrame.R and changed the code on line 1374 and line 1378 to change the "semijoin" to "leftsemi" to match the Java function's parameters. These also make the R joinType accepted values match those of Scala's. semijoin: {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid} if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "semijoin")) { sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType) } else { stop("joinType must be one of the following types: ", "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'") } {code} leftsemi: {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid} if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "leftsemi")) { sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType) } else { stop("joinType must be one of the following types: ", "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'") } {code} This fixed the issue, but I'm not sure if this solution breaks hive compatibility or causes other issues, but I can submit a pull request to change this was: I am using SparkR from RStudio, and I ran into an error with the join function that I recreated with a smaller example: {code:title=joinTest.R|borderStyle=solid} Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) library(SparkR) sc <- sparkR.init("local[4]") sqlContext <- sparkRSQL.init(sc) n = c(2, 3, 5) s = c("aa", "bb", "cc") b = c(TRUE, FALSE, TRUE) df = data.frame(n, s, b) df1= createDataFrame(sqlContext, df) showDF(df1) x = c(2, 3, 10) t = c("dd", "ee", "ff") c = c(FALSE, FALSE, TRUE) dff = data.frame(x, t, c) df2 = createDataFrame(sqlContext, dff) showDF(df2) res = join(df1, df2, df1$n == df2$x, "semijoin") showDF(res) {code} Running this code, I encountered the error: {panel} Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. Supported join types include: 'inner', 'outer', 'full', 'fullouter', 'leftouter', 'left', 'rightouter', 'right', 'leftsemi'. {panel} However, if I changed the joinType to "leftsemi", {code} res = join(df1, df2, df1$n == df2$x, "leftsemi") {code} I would get the error: {panel} Error in .local(x, y, ...) : joinType must be one of the following types: 'inner', 'outer', 'left_outer', 'right_outer', 'semijoin' {panel} Since the join function in R appears to invoke a Java method, I went into DataFrame.R and changed the code on line 1374 and line 1378 to change the "semijoin" to "leftsemi" to match the Java function's parameters. These also make the R joinType accepted values match those of Scala's. semijoin: {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid} if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "semijoin")) { sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType) } else { stop("joinType must be one of the following types: ", "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'") } {code} leftsemi: {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid} if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "leftsemi")) { sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType) } else { stop("joinType must be one of the following types: ", "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'") } {code} This fixed the issue, but I'm not sure if this solution breaks hive compatibility or causes other issues, or if this issue is caused by a compatibility issue elsewhere. > R semijoin leads to Java errors, R leftsemi leads to Spark errors > ----------------------------------------------------------------- > > Key: SPARK-10981 > URL: https://issues.apache.org/jira/browse/SPARK-10981 > Project: Spark > Issue Type: Bug > Components: R > Affects Versions: 1.5.0 > Environment: SparkR from RStudio on Macbook > Reporter: Monica Liu > Priority: Minor > Labels: easyfix, newbie > > I am using SparkR from RStudio, and I ran into an error with the join > function that I recreated with a smaller example: > {code:title=joinTest.R|borderStyle=solid} > Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/") > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > library(SparkR) > sc <- sparkR.init("local[4]") > sqlContext <- sparkRSQL.init(sc) > n = c(2, 3, 5) > s = c("aa", "bb", "cc") > b = c(TRUE, FALSE, TRUE) > df = data.frame(n, s, b) > df1= createDataFrame(sqlContext, df) > showDF(df1) > x = c(2, 3, 10) > t = c("dd", "ee", "ff") > c = c(FALSE, FALSE, TRUE) > dff = data.frame(x, t, c) > df2 = createDataFrame(sqlContext, dff) > showDF(df2) > res = join(df1, df2, df1$n == df2$x, "semijoin") > showDF(res) > {code} > Running this code, I encountered the error: > {panel} > Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : > java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. > Supported join types include: 'inner', 'outer', 'full', 'fullouter', > 'leftouter', 'left', 'rightouter', 'right', 'leftsemi'. > {panel} > However, if I changed the joinType to "leftsemi", > {code} > res = join(df1, df2, df1$n == df2$x, "leftsemi") > {code} > I would get the error: > {panel} > Error in .local(x, y, ...) : > joinType must be one of the following types: 'inner', 'outer', > 'left_outer', 'right_outer', 'semijoin' > {panel} > Since the join function in R appears to invoke a Java method, I went into > DataFrame.R and changed the code on line 1374 and line 1378 to change the > "semijoin" to "leftsemi" to match the Java function's parameters. These also > make the R joinType accepted values match those of Scala's. > semijoin: > {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid} > if (joinType %in% c("inner", "outer", "left_outer", "right_outer", > "semijoin")) { > sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType) > } > else { > stop("joinType must be one of the following types: ", > "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'") > } > {code} > leftsemi: > {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid} > if (joinType %in% c("inner", "outer", "left_outer", "right_outer", > "leftsemi")) { > sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType) > } > else { > stop("joinType must be one of the following types: ", > "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'") > } > {code} > This fixed the issue, but I'm not sure if this solution breaks hive > compatibility or causes other issues, but I can submit a pull request to > change this -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org