[jira] [Commented] (SPARK-12526) `ifelse`, `when`, `otherwise` unable to take Column as value
[ https://issues.apache.org/jira/browse/SPARK-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122611#comment-15122611 ] Deborah Siegel commented on SPARK-12526: I'm trying to use ifelse (or when and otherwise) to replace some of the values in a column (or create a new column which uses the old column for some of its values). I believe this falls into the case of wanting to do the above example >ifelse(df$mpg > 0, df$mpg, 0) Here's my example: aq$mynewcolumn <- ifelse(aq$Ozone != "NA", aq$Ozone, "") or aq$mynewcolumn <- otherwise(when(aq$Ozone == "NA", ""), aq$Ozone) Is it correct that this still won't work in 1.6.0 with: Error in rep(yes, length.out = length(ans)) : attempt to replicate an object of type 'environment' > `ifelse`, `when`, `otherwise` unable to take Column as value > > > Key: SPARK-12526 > URL: https://issues.apache.org/jira/browse/SPARK-12526 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2, 1.6.0 >Reporter: Sen Fang >Assignee: Sen Fang > Fix For: 1.6.1, 2.0.0 > > > When passing a Column to {{ifelse}}, {{when}}, {{otherwise}}, it will error > out with > {code} > attempt to replicate an object of type 'environment' > {code} > The problems lies in the use of base R {{ifelse}} function, which is > vectorized version of {{if ... else ...}} idiom, but it is unable to > replicate a Column's job id as it is an environment. > Considering {{callJMethod}} was never designed to be vectorized, the safe > option is to replace {{ifelse}} with {{if ... else ...}} instead. However > technically this is inconsistent to base R's ifelse, which is meant to be > vectorized. > I can send a PR for review first and discuss further if there is scenario at > all when `ifelse`, `when`, `otherwise` would be used vectorizedly. > A dummy example is: > {code} > ifelse(lit(1) == lit(1), lit(2), lit(3)) > {code} > A concrete example might be: > {code} > ifelse(df$mpg > 0, df$mpg, 0) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944007#comment-14944007 ] Deborah Siegel edited comment on SPARK-9318 at 10/5/15 9:00 PM: not sure about the fix. I tried this on 1.5.0 and 1.5.1, same results. regarding the alias column, the issue is that "." in the schema is being converted to "_" behind the scenes. This happens automatically when createDataFrame is used. But it seems that with alias, it is not being converted, however the join is looking for the converted name. this works: ydfsel <- select(ydf, alias(ydf$k1,"k1_y"), alias(ydf$k2,"k2_y"), alias(ydf$data,"data_y")) xdfsel <- select(xdf, alias(xdf$k1,"k1_x"), alias(xdf$k2,"k2_x"), alias(xdf$data,"data_x")) res3 <- join(xdfsel,ydfsel,xdfsel$k1_x==ydfsel$k1_y) was (Author: dsiegel): not sure about the fix. I tried this on 1.5.0 and 1.5.1, same results. regarding the alias column, the issue is that "." in the schema is being converted to "_" behind the scenes. This happens automatically when createDataFrame is used. But it seems that with alias, it is not being converted, however the select is looking for the converted name. this works: ydfsel <- select(ydf, alias(ydf$k1,"k1_y"), alias(ydf$k2,"k2_y"), alias(ydf$data,"data_y")) xdfsel <- select(xdf, alias(xdf$k1,"k1_x"), alias(xdf$k2,"k2_x"), alias(xdf$data,"data_x")) res3 <- join(xdfsel,ydfsel,xdfsel$k1_x==ydfsel$k1_y) > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944007#comment-14944007 ] Deborah Siegel edited comment on SPARK-9318 at 10/5/15 9:01 PM: not sure about the fix. I tried this on 1.5.0 and 1.5.1, same results. regarding the alias column, the issue is that "." in the schema is being converted to "_" behind the scenes. This happens automatically when createDataFrame is used. But it seems that with alias, it is not being converted, however it seems like maybe the join is looking for the converted name. this works: ydfsel <- select(ydf, alias(ydf$k1,"k1_y"), alias(ydf$k2,"k2_y"), alias(ydf$data,"data_y")) xdfsel <- select(xdf, alias(xdf$k1,"k1_x"), alias(xdf$k2,"k2_x"), alias(xdf$data,"data_x")) res3 <- join(xdfsel,ydfsel,xdfsel$k1_x==ydfsel$k1_y) was (Author: dsiegel): not sure about the fix. I tried this on 1.5.0 and 1.5.1, same results. regarding the alias column, the issue is that "." in the schema is being converted to "_" behind the scenes. This happens automatically when createDataFrame is used. But it seems that with alias, it is not being converted, however the join is looking for the converted name. this works: ydfsel <- select(ydf, alias(ydf$k1,"k1_y"), alias(ydf$k2,"k2_y"), alias(ydf$data,"data_y")) xdfsel <- select(xdf, alias(xdf$k1,"k1_x"), alias(xdf$k2,"k2_x"), alias(xdf$data,"data_x")) res3 <- join(xdfsel,ydfsel,xdfsel$k1_x==ydfsel$k1_y) > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944007#comment-14944007 ] Deborah Siegel commented on SPARK-9318: --- not sure about the fix. I tried this on 1.5.0 and 1.5.1, same results. regarding the alias column, the issue is that "." in the schema is being converted to "_" behind the scenes. This happens automatically when createDataFrame is used. But it seems that with alias, it is not being converted, however the select is looking for the converted name. this works: ydfsel <- select(ydf, alias(ydf$k1,"k1_y"), alias(ydf$k2,"k2_y"), alias(ydf$data,"data_y")) xdfsel <- select(xdf, alias(xdf$k1,"k1_x"), alias(xdf$k2,"k2_x"), alias(xdf$data,"data_x")) res3 <- join(xdfsel,ydfsel,xdfsel$k1_x==ydfsel$k1_y) > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join
[ https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943810#comment-14943810 ] Deborah Siegel commented on SPARK-9318: --- Narine, just want to offer that I haven't replicated that problem. x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5) y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5) xdf <- createDataFrame(sqlContext, x) ydf <- createDataFrame(sqlContext, y) res <- join(xdf,ydf) head(res) k1 k2 data k1 k2 data 1 NA 11 NA NA1 2 NA 11 2 NA2 3 NA 11 NA 33 4 NA 11 4 44 5 NA 11 5 55 6 NA NA2 NA NA1 > printSchema(res) root |-- k1: double (nullable = true) |-- k2: double (nullable = true) |-- data: integer (nullable = true) |-- k1: double (nullable = true) |-- k2: double (nullable = true) |-- data: integer (nullable = true) > Add `merge` as synonym for join > --- > > Key: SPARK-9318 > URL: https://issues.apache.org/jira/browse/SPARK-9318 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Hossein Falaki > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6823) Add a model.matrix like capability to DataFrames (modelDataFrame)
[ https://issues.apache.org/jira/browse/SPARK-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943410#comment-14943410 ] Deborah Siegel commented on SPARK-6823: --- does SPARK-9681 address RFormula supporting identity? eg. (~ . ) for model.matrix of features? would be useful. > Add a model.matrix like capability to DataFrames (modelDataFrame) > - > > Key: SPARK-6823 > URL: https://issues.apache.org/jira/browse/SPARK-6823 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Shivaram Venkataraman > > Currently Mllib modeling tools work only with double data. However, data > tables in practice often have a set of categorical fields (factors in R), > that need to be converted to a set of 0/1 indicator variables (making the > data actually used in a modeling algorithm completely numeric). In R, this is > handled in modeling functions using the model.matrix function. Similar > functionality needs to be available within Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8724) Need documentation on how to deploy or use SparkR in Spark 1.4.0+
[ https://issues.apache.org/jira/browse/SPARK-8724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deborah Siegel updated SPARK-8724: -- Comment: was deleted (was: I would work on this documentation if thats fine. ) > Need documentation on how to deploy or use SparkR in Spark 1.4.0+ > - > > Key: SPARK-8724 > URL: https://issues.apache.org/jira/browse/SPARK-8724 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 1.4.0 >Reporter: Felix Cheung >Priority: Minor > > As of now there doesn't seem to be any official documentation on how to > deploy SparkR with Spark 1.4.0+ > Also, cluster manager specific documentation (like > http://spark.apache.org/docs/latest/spark-standalone.html) does not call out > what mode is supported for SparkR and details on deployment steps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8724) Need documentation on how to deploy or use SparkR in Spark 1.4.0+
[ https://issues.apache.org/jira/browse/SPARK-8724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741007#comment-14741007 ] Deborah Siegel commented on SPARK-8724: --- I would work on this documentation if thats fine. > Need documentation on how to deploy or use SparkR in Spark 1.4.0+ > - > > Key: SPARK-8724 > URL: https://issues.apache.org/jira/browse/SPARK-8724 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 1.4.0 >Reporter: Felix Cheung >Priority: Minor > > As of now there doesn't seem to be any official documentation on how to > deploy SparkR with Spark 1.4.0+ > Also, cluster manager specific documentation (like > http://spark.apache.org/docs/latest/spark-standalone.html) does not call out > what mode is supported for SparkR and details on deployment steps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9713) Document SparkR MLlib glm() integration in Spark 1.5
[ https://issues.apache.org/jira/browse/SPARK-9713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729607#comment-14729607 ] Deborah Siegel commented on SPARK-9713: --- It seems that summary() on binomial classification model in glm is not available as it is for gaussian model? Does SparkRWrappers.scala need to be updated, or the documentation? >dbwt <- createDataFrame(sqlContext, birthwt) >logisticmodel <- glm(smoke ~ bwt, data = dbwt, family = "binomial") >summary(logisticmodel) 15/09/03 12:02:43 ERROR RBackendHandler: getModelFeatures on org.apache.spark.ml.api.r.SparkRWrappers failed Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.UnsupportedOperationException: No features names available for LogisticRegressionModel > Document SparkR MLlib glm() integration in Spark 1.5 > > > Key: SPARK-9713 > URL: https://issues.apache.org/jira/browse/SPARK-9713 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, SparkR >Affects Versions: 1.5.0 >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Critical > Fix For: 1.5.0 > > > The new SparkR functions in mllib.R should be documented: glm(), predict(), > and summary(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)
[ https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717473#comment-14717473 ] Deborah Siegel commented on SPARK-9316: --- Now that %in% is exported in namespace, both the filter and the '[' syntax work with it. Thanks [~shivaram]. [~felixcheung] Not apparent to me at the moment why one would need support for quoted syntax in the brackets with %in% working. btw, although filter works with ("age in (19, 30)"), the bracket notation with the quotes still getting error both ways > subsetdf <- df["age in (19, 30),1:2"] Error in df["age in (19, 30),1:2"] : object of type 'S4' is not subsettable > subsetdf <- df["age in (19, 30)",1:2] Error in df["age in (19, 30)", 1:2] : object of type 'S4' is not subsettable > Add support for filtering using `[` (synonym for filter / select) > - > > Key: SPARK-9316 > URL: https://issues.apache.org/jira/browse/SPARK-9316 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Felix Cheung > Fix For: 1.6.0, 1.5.1 > > > Will help us support queries of the form > {code} > air[air$UniqueCarrier %in% c("UA", "HA"), c(1,2,3,5:9)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)
[ https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14715472#comment-14715472 ] Deborah Siegel edited comment on SPARK-9316 at 8/26/15 9:03 PM: Hi. I just built spark from the latest github today. This worked as expected: >subsetdf <- df[df$name == "Andy", c(1,2)] However, with all due respect, got some errors. >subsetdf <- df[df$age %in% c(19, 30), 1:2] Error in df[df$age %in% c(19, 30), 1:2] : error in evaluating the argument 'i' in selecting a method for function '[': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments the following filter worked previously for me: >subsetdf <- filter(df, "age in (19, 30)") so I tried this syntax, which gave different error: > subsetdf <- df["age in (19, 30),1:2"] Error in df["age in (19, 30),1:2"] : object of type 'S4' is not subsettable was (Author: dsiegel): Hi. I just built spark from the latest github today. This worked as expected: >subsetdf <- df[df$name == "Andy", c(1,2)] However, not sure about this: >subsetdf <- df[df$age %in% c(19, 30), 1:2] Error in df[df$age %in% c(19, 30), 1:2] : error in evaluating the argument 'i' in selecting a method for function '[': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments the following filter works for me: >subsetdf <- filter(df, "age in (19, 30)") so I tried this syntax, which didn't work: > subsetdf <- df["age in (19, 30),1:2"] Error in df["age in (19, 30),1:2"] : object of type 'S4' is not subsettable > Add support for filtering using `[` (synonym for filter / select) > - > > Key: SPARK-9316 > URL: https://issues.apache.org/jira/browse/SPARK-9316 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Felix Cheung > Fix For: 1.6.0, 1.5.1 > > > Will help us support queries of the form > {code} > air[air$UniqueCarrier %in% c("UA", "HA"), c(1,2,3,5:9)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)
[ https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14715472#comment-14715472 ] Deborah Siegel commented on SPARK-9316: --- Hi. I just built spark from the latest github today. This worked as expected: >subsetdf <- df[df$name == "Andy", c(1,2)] However, not sure about this: >subsetdf <- df[df$age %in% c(19, 30), 1:2] Error in df[df$age %in% c(19, 30), 1:2] : error in evaluating the argument 'i' in selecting a method for function '[': Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments the following filter works for me: >subsetdf <- filter(df, "age in (19, 30)") so I tried this syntax, which didn't work: > subsetdf <- df["age in (19, 30),1:2"] Error in df["age in (19, 30),1:2"] : object of type 'S4' is not subsettable > Add support for filtering using `[` (synonym for filter / select) > - > > Key: SPARK-9316 > URL: https://issues.apache.org/jira/browse/SPARK-9316 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Felix Cheung > Fix For: 1.6.0, 1.5.1 > > > Will help us support queries of the form > {code} > air[air$UniqueCarrier %in% c("UA", "HA"), c(1,2,3,5:9)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7136) Spark SQL and DataFrame Guide - missing file paths and non-existent example file
Deborah Siegel created SPARK-7136: - Summary: Spark SQL and DataFrame Guide - missing file paths and non-existent example file Key: SPARK-7136 URL: https://issues.apache.org/jira/browse/SPARK-7136 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.3.1 Reporter: Deborah Siegel Priority: Minor The example code in the "Generic Load/Save Functions" section needs a path to the people.parquet file to load. Additionally, there is no file "people.parquet" in /examples/src/main/resources/ A file "people.parquet" is in fact saved off in a later section (Parquet Files -Loading Data Programmatically). The best fix is probably a way in which all the example code can run independently of whether the other example code has been run. Proposal is to instead use a file which does exist for Generic Load/Save Functions : /examples/src/main/resources/users.parquet (of course changing the fields which are selected as well). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7102) update apache hosted graphx-programming-guide doc
Deborah Siegel created SPARK-7102: - Summary: update apache hosted graphx-programming-guide doc Key: SPARK-7102 URL: https://issues.apache.org/jira/browse/SPARK-7102 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.3.1 Reporter: Deborah Siegel Priority: Trivial After reading the "Contributing to Spark" guide, I've realized that a previously merged pull request improving the spark documentation for graphx-programming-guide has not been made to the apache hosted site outside of the docs. Is this because I did not make a JIRA for it? this is in regards to pull request: aggregateMessages example in graphX doc #4853 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org