[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19243 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19243#discussion_r139884675 --- Diff: R/pkg/R/DataFrame.R --- @@ -998,33 +998,44 @@ setMethod("unique", #' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) +#' collect(sample(df, fraction = 0.5)) #' collect(sample(df, FALSE, 0.5)) -#' collect(sample(df, TRUE, 0.5)) +#' collect(sample(df, TRUE, 0.5, seed = 3)) #'} #' @note sample since 1.4.0 setMethod("sample", - signature(x = "SparkDataFrame", withReplacement = "logical", -fraction = "numeric"), - function(x, withReplacement, fraction, seed) { -if (fraction < 0.0) stop(cat("Negative fraction value:", fraction)) + signature(x = "SparkDataFrame"), + function(x, withReplacement = FALSE, fraction, seed) { +if (!is.numeric(fraction)) { + stop(paste("fraction must be numeric; however, got", class(fraction))) +} +if (!is.logical(withReplacement)) { + stop(paste("withReplacement must be logical; however, got", class(withReplacement))) +} + if (!missing(seed)) { + if (is.null(seed) || is.na(seed)) { +stop(paste("seed must not be NULL or NA; however, got", class(seed))) --- End diff -- I see .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/19243#discussion_r139868488 --- Diff: R/pkg/R/DataFrame.R --- @@ -998,33 +998,44 @@ setMethod("unique", #' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) +#' collect(sample(df, fraction = 0.5)) #' collect(sample(df, FALSE, 0.5)) -#' collect(sample(df, TRUE, 0.5)) +#' collect(sample(df, TRUE, 0.5, seed = 3)) #'} #' @note sample since 1.4.0 setMethod("sample", - signature(x = "SparkDataFrame", withReplacement = "logical", -fraction = "numeric"), - function(x, withReplacement, fraction, seed) { -if (fraction < 0.0) stop(cat("Negative fraction value:", fraction)) + signature(x = "SparkDataFrame"), + function(x, withReplacement = FALSE, fraction, seed) { +if (!is.numeric(fraction)) { + stop(paste("fraction must be numeric; however, got", class(fraction))) +} +if (!is.logical(withReplacement)) { + stop(paste("withReplacement must be logical; however, got", class(withReplacement))) +} + if (!missing(seed)) { + if (is.null(seed) || is.na(seed)) { +stop(paste("seed must not be NULL or NA; however, got", class(seed))) --- End diff -- this actually doesn't work for NA ``` > class(NULL) [1] "NULL" > class(NA) [1] "logical" ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19243#discussion_r139665152 --- Diff: R/pkg/R/DataFrame.R --- @@ -998,33 +998,39 @@ setMethod("unique", #' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) +#' collect(sample(df, fraction=0.5)) #' collect(sample(df, FALSE, 0.5)) -#' collect(sample(df, TRUE, 0.5)) +#' collect(sample(df, TRUE, 0.5, seed=3)) --- End diff -- D'oh. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/19243#discussion_r139602239 --- Diff: R/pkg/R/DataFrame.R --- @@ -998,33 +998,39 @@ setMethod("unique", #' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) +#' collect(sample(df, fraction=0.5)) #' collect(sample(df, FALSE, 0.5)) -#' collect(sample(df, TRUE, 0.5)) +#' collect(sample(df, TRUE, 0.5, seed=3)) #'} #' @note sample since 1.4.0 setMethod("sample", - signature(x = "SparkDataFrame", withReplacement = "logical", -fraction = "numeric"), - function(x, withReplacement, fraction, seed) { -if (fraction < 0.0) stop(cat("Negative fraction value:", fraction)) + signature(x = "SparkDataFrame"), + function(x, withReplacement = FALSE, fraction, seed) { +if (!is.numeric(fraction)) { + stop(paste("fraction must be numeric; however, got", class(fraction))) +} +if (!is.logical(withReplacement)) { + stop(paste("withReplacement must be logical; however, got", class(withReplacement))) +} if (!missing(seed)) { # TODO : Figure out how to send integer as java.lang.Long to JVM so # we can send seed as an argument through callJMethod - sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction, as.integer(seed)) + sdf <- handledCallJMethod(x@sdf, "sample", withReplacement, --- End diff -- I'd then wrap `withReplacement` as `as.logical(withReplacement)` and `fraction` as `as.numeric(fraction)` because it might be coercible (note the L) ``` > is.numeric(1L) [1] TRUE ``` but passing as integer could cause callJMethod to match to a different signature on the JVM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/19243#discussion_r139602407 --- Diff: R/pkg/R/DataFrame.R --- @@ -998,33 +998,39 @@ setMethod("unique", #' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) +#' collect(sample(df, fraction=0.5)) #' collect(sample(df, FALSE, 0.5)) -#' collect(sample(df, TRUE, 0.5)) +#' collect(sample(df, TRUE, 0.5, seed=3)) #'} #' @note sample since 1.4.0 setMethod("sample", - signature(x = "SparkDataFrame", withReplacement = "logical", -fraction = "numeric"), - function(x, withReplacement, fraction, seed) { -if (fraction < 0.0) stop(cat("Negative fraction value:", fraction)) + signature(x = "SparkDataFrame"), + function(x, withReplacement = FALSE, fraction, seed) { +if (!is.numeric(fraction)) { + stop(paste("fraction must be numeric; however, got", class(fraction))) +} +if (!is.logical(withReplacement)) { + stop(paste("withReplacement must be logical; however, got", class(withReplacement))) +} if (!missing(seed)) { # TODO : Figure out how to send integer as java.lang.Long to JVM so # we can send seed as an argument through callJMethod - sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction, as.integer(seed)) + sdf <- handledCallJMethod(x@sdf, "sample", withReplacement, +fraction, as.integer(seed)) --- End diff -- we should be careful only `as.integer` if it isn't NULL or NA ``` > as.integer(NULL) integer(0) > as.integer(NA) [1] NA ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/19243#discussion_r139601709 --- Diff: R/pkg/R/DataFrame.R --- @@ -984,12 +984,12 @@ setMethod("unique", #' of the total count of of the given SparkDataFrame. #' #' @param x A SparkDataFrame -#' @param withReplacement Sampling with replacement or not +#' @param withReplacement Sampling with replacement or not. Default is \code{FALSE}. --- End diff -- we actually want to change to not documenting the default value if it is already in the signature - because then it would be the roxygen2 generated doc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/19243#discussion_r139601790 --- Diff: R/pkg/R/DataFrame.R --- @@ -998,33 +998,39 @@ setMethod("unique", #' sparkR.session() #' path <- "path/to/file.json" #' df <- read.json(path) +#' collect(sample(df, fraction=0.5)) #' collect(sample(df, FALSE, 0.5)) -#' collect(sample(df, TRUE, 0.5)) +#' collect(sample(df, TRUE, 0.5, seed=3)) --- End diff -- I think the style here is to have one space between the param name and its value, like `seed = 3` and `fraction = 0.5` above --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/19243 [SPARK-21780][R] Simpler Dataset.sample API in R ## What changes were proposed in this pull request? This PR make `sample(...)` able to omit `withReplacement` defaulting to `FALSE`, consistently with equivalent Scala / Java / Python API. In short, the following examples are allowed: ```r > df <- createDataFrame(as.list(seq(10))) > count(sample(df, 0.5, 3)) [1] 4 > count(sample(df, fraction=0.5, seed=3)) [1] 4 > count(sample(df, withReplacement=TRUE, fraction=0.5, seed=3)) [1] 2 > count(sample(df, 1.0)) [1] 10 > count(sample(df, fraction=1.0)) [1] 10 > count(sample(df, FALSE, fraction=1.0)) [1] 10 > count(sample(df, 1.0, withReplacement=FALSE)) [1] 10 ``` In addition, this PR also adds some type checking logics as below: ```r > sample(df) Error in sample(df) : x (required), withReplacement (optional), fraction (required) and seed (optional) should be SparkDataFrame, logical, numeric and numeric; however, got [SparkDataFrame] > sample(df, "a") Error in sample(df, "a") : x (required), withReplacement (optional), fraction (required) and seed (optional) should be SparkDataFrame, logical, numeric and numeric; however, got [SparkDataFrame, character] > sample(df, TRUE, seed="abc") Error in sample(df, TRUE, seed = "abc") : x (required), withReplacement (optional), fraction (required) and seed (optional) should be SparkDataFrame, logical, numeric and numeric; however, got [SparkDataFrame, logical, character] > sample(df, -1.0) ... Error in sample : illegal argument - requirement failed: Sampling fraction (-1.0) must be on interval [0, 1] without replacement ``` ## How was this patch tested? Manually tested, unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-21780 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19243.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19243 commit 680157ef95e5ef4a898e339749d6a8bb2d464991 Author: hyukjinkwonDate: 2017-09-15T07:10:09Z Simpler Dataset.sample API in R --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org