[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945835#comment-14945835 ] Hossein Falaki commented on SPARK-9318: --- I agree with the issue being discussed. SparkR should have called this signature of join: {code} def join(right: DataFrame, usingColumns: Seq[String]): DataFrame {code} This version of DataFrame.join makes sure only a single join column is returned. Right now the join (and merge) behavior in SparkR is not what R users expect. > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945402#comment-14945402 ] Shivaram Venkataraman commented on SPARK-9318: -- [~Narine] Could you post general questions / issues with `join` to the user / dev mailing list ? That way all the devs can respond to this. > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943784#comment-14943784 ] Narine Kokhlikyan commented on SPARK-9318: -- Hi all, [~shivaram], [~falaki], I am working on the new signature for merge and have noticed that the join in general has serous issues. I took one of the examples from R base:::merge - https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5) y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5) I want to do join on this two dataframes: res <- join(xdf,ydf) res has the following structure: DataFrame[k1:double, k2:double, data:int, k1:double, k2:double, data:int] but when I do head(res) I get the following: k1 k2 data 1 NA NA1 2 2 NA2 3 NA 33 4 4 44 5 5 55 6 NA NA1 This is not what I was expecting. The structure is inconsistent with the content/data I see with head. I tried to put aliases for those columns which have the same names for both data frames with: ydfsel <- select(ydf, alias(ydf$k1,"k1.y"), alias(ydf$k2,"k2.y"), alias(ydf$data,"data.y")) xdfsel <- select(xdf, alias(xdf$k1,"k1.x"), alias(xdf$k2,"k2.x"), alias(xdf$data,"data.x")) and this actually works and when I do: join(xdfsel, ydfsel ) - this also works but the following fails: join(xdfsel,ydfsel,xdfsel$k1.x==ydfsel$k1.y) This means that I cannot refer to alias column?? Do you know what the issue here is ? Thanks, Narine > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943810#comment-14943810 ] Deborah Siegel commented on SPARK-9318: --- Narine, just want to offer that I haven't replicated that problem. x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5) y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5) xdf <- createDataFrame(sqlContext, x) ydf <- createDataFrame(sqlContext, y) res <- join(xdf,ydf) head(res) k1 k2 data k1 k2 data 1 NA 11 NA NA1 2 NA 11 2 NA2 3 NA 11 NA 33 4 NA 11 4 44 5 NA 11 5 55 6 NA NA2 NA NA1 > printSchema(res) root |-- k1: double (nullable = true) |-- k2: double (nullable = true) |-- data: integer (nullable = true) |-- k1: double (nullable = true) |-- k2: double (nullable = true) |-- data: integer (nullable = true) > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943988#comment-14943988 ] Narine Kokhlikyan commented on SPARK-9318: -- printSchema is showing up correctly for me too. Only the head function returns unexpected result > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943986#comment-14943986 ] Narine Kokhlikyan commented on SPARK-9318: -- Hi [~dsiegel], thanks for checking it. Was there a recent fix related to that ? Also, have you tried the aliases ? Is it working for you ? > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944007#comment-14944007 ] Deborah Siegel commented on SPARK-9318: --- not sure about the fix. I tried this on 1.5.0 and 1.5.1, same results. regarding the alias column, the issue is that "." in the schema is being converted to "_" behind the scenes. This happens automatically when createDataFrame is used. But it seems that with alias, it is not being converted, however the select is looking for the converted name. this works: ydfsel <- select(ydf, alias(ydf$k1,"k1_y"), alias(ydf$k2,"k2_y"), alias(ydf$data,"data_y")) xdfsel <- select(xdf, alias(xdf$k1,"k1_x"), alias(xdf$k2,"k2_x"), alias(xdf$data,"data_x")) res3 <- join(xdfsel,ydfsel,xdfsel$k1_x==ydfsel$k1_y) > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944056#comment-14944056 ] Narine Kokhlikyan commented on SPARK-9318: -- I asked other ppl to try this and they all see k1 k2 data 1 NA NA1 2 2 NA2 3 NA 33 4 4 44 5 5 55 6 NA NA1 we just run : x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5) y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5) xdf <- createDataFrame(sqlContext, x) ydf <- createDataFrame(sqlContext, y) res <- join(xdf,ydf) head(res) Can anyone else try this ? [~olarayej] [~shivaram] > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934484#comment-14934484 ] Shivaram Venkataraman commented on SPARK-9318: -- Sure. Feel free to send a PR and cc [~falaki] on it as well. > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934376#comment-14934376 ] Narine Kokhlikyan commented on SPARK-9318: -- Hi guys, can we reopen this issue Shivaram Venkataraman ? The reason is that merge should follow the following signature: merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"), incomparables = NULL, ...) I'm working on this and will do a pull request soon . Thanks, Narine > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648387#comment-14648387 ] Apache Spark commented on SPARK-9318: - User 'falaki' has created a pull request for this issue: https://github.com/apache/spark/pull/7806 Add `merge` as synonym for join --- Key: SPARK-9318 URL: https://issues.apache.org/jira/browse/SPARK-9318 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org