[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R

2017-09-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19243


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R

2017-09-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19243#discussion_r139884675
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -998,33 +998,44 @@ setMethod("unique",
 #' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
+#' collect(sample(df, fraction = 0.5))
 #' collect(sample(df, FALSE, 0.5))
-#' collect(sample(df, TRUE, 0.5))
+#' collect(sample(df, TRUE, 0.5, seed = 3))
 #'}
 #' @note sample since 1.4.0
 setMethod("sample",
-  signature(x = "SparkDataFrame", withReplacement = "logical",
-fraction = "numeric"),
-  function(x, withReplacement, fraction, seed) {
-if (fraction < 0.0) stop(cat("Negative fraction value:", 
fraction))
+  signature(x = "SparkDataFrame"),
+  function(x, withReplacement = FALSE, fraction, seed) {
+if (!is.numeric(fraction)) {
+  stop(paste("fraction must be numeric; however, got", 
class(fraction)))
+}
+if (!is.logical(withReplacement)) {
+  stop(paste("withReplacement must be logical; however, got", 
class(withReplacement)))
+}
+
 if (!missing(seed)) {
+  if (is.null(seed) || is.na(seed)) {
+stop(paste("seed must not be NULL or NA; however, got", 
class(seed)))
--- End diff --

I see ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R

2017-09-19 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19243#discussion_r139868488
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -998,33 +998,44 @@ setMethod("unique",
 #' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
+#' collect(sample(df, fraction = 0.5))
 #' collect(sample(df, FALSE, 0.5))
-#' collect(sample(df, TRUE, 0.5))
+#' collect(sample(df, TRUE, 0.5, seed = 3))
 #'}
 #' @note sample since 1.4.0
 setMethod("sample",
-  signature(x = "SparkDataFrame", withReplacement = "logical",
-fraction = "numeric"),
-  function(x, withReplacement, fraction, seed) {
-if (fraction < 0.0) stop(cat("Negative fraction value:", 
fraction))
+  signature(x = "SparkDataFrame"),
+  function(x, withReplacement = FALSE, fraction, seed) {
+if (!is.numeric(fraction)) {
+  stop(paste("fraction must be numeric; however, got", 
class(fraction)))
+}
+if (!is.logical(withReplacement)) {
+  stop(paste("withReplacement must be logical; however, got", 
class(withReplacement)))
+}
+
 if (!missing(seed)) {
+  if (is.null(seed) || is.na(seed)) {
+stop(paste("seed must not be NULL or NA; however, got", 
class(seed)))
--- End diff --

this actually doesn't work for NA
```
> class(NULL)
[1] "NULL"
> class(NA)
[1] "logical"
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R

2017-09-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19243#discussion_r139665152
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -998,33 +998,39 @@ setMethod("unique",
 #' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
+#' collect(sample(df, fraction=0.5))
 #' collect(sample(df, FALSE, 0.5))
-#' collect(sample(df, TRUE, 0.5))
+#' collect(sample(df, TRUE, 0.5, seed=3))
--- End diff --

D'oh.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R

2017-09-18 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19243#discussion_r139602239
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -998,33 +998,39 @@ setMethod("unique",
 #' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
+#' collect(sample(df, fraction=0.5))
 #' collect(sample(df, FALSE, 0.5))
-#' collect(sample(df, TRUE, 0.5))
+#' collect(sample(df, TRUE, 0.5, seed=3))
 #'}
 #' @note sample since 1.4.0
 setMethod("sample",
-  signature(x = "SparkDataFrame", withReplacement = "logical",
-fraction = "numeric"),
-  function(x, withReplacement, fraction, seed) {
-if (fraction < 0.0) stop(cat("Negative fraction value:", 
fraction))
+  signature(x = "SparkDataFrame"),
+  function(x, withReplacement = FALSE, fraction, seed) {
+if (!is.numeric(fraction)) {
+  stop(paste("fraction must be numeric; however, got", 
class(fraction)))
+}
+if (!is.logical(withReplacement)) {
+  stop(paste("withReplacement must be logical; however, got", 
class(withReplacement)))
+}
 if (!missing(seed)) {
   # TODO : Figure out how to send integer as java.lang.Long to 
JVM so
   # we can send seed as an argument through callJMethod
-  sdf <- callJMethod(x@sdf, "sample", withReplacement, 
fraction, as.integer(seed))
+  sdf <- handledCallJMethod(x@sdf, "sample", withReplacement,
--- End diff --

I'd then wrap `withReplacement` as `as.logical(withReplacement)` and 
`fraction` as `as.numeric(fraction)`
because it might be coercible (note the L)
```
> is.numeric(1L)
[1] TRUE
```
but passing as integer could cause callJMethod to match to a different 
signature on the JVM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R

2017-09-18 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19243#discussion_r139602407
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -998,33 +998,39 @@ setMethod("unique",
 #' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
+#' collect(sample(df, fraction=0.5))
 #' collect(sample(df, FALSE, 0.5))
-#' collect(sample(df, TRUE, 0.5))
+#' collect(sample(df, TRUE, 0.5, seed=3))
 #'}
 #' @note sample since 1.4.0
 setMethod("sample",
-  signature(x = "SparkDataFrame", withReplacement = "logical",
-fraction = "numeric"),
-  function(x, withReplacement, fraction, seed) {
-if (fraction < 0.0) stop(cat("Negative fraction value:", 
fraction))
+  signature(x = "SparkDataFrame"),
+  function(x, withReplacement = FALSE, fraction, seed) {
+if (!is.numeric(fraction)) {
+  stop(paste("fraction must be numeric; however, got", 
class(fraction)))
+}
+if (!is.logical(withReplacement)) {
+  stop(paste("withReplacement must be logical; however, got", 
class(withReplacement)))
+}
 if (!missing(seed)) {
   # TODO : Figure out how to send integer as java.lang.Long to 
JVM so
   # we can send seed as an argument through callJMethod
-  sdf <- callJMethod(x@sdf, "sample", withReplacement, 
fraction, as.integer(seed))
+  sdf <- handledCallJMethod(x@sdf, "sample", withReplacement,
+fraction, as.integer(seed))
--- End diff --

we should be careful only `as.integer` if it isn't NULL or NA
```
> as.integer(NULL)
integer(0)
> as.integer(NA)
[1] NA
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R

2017-09-18 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19243#discussion_r139601709
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -984,12 +984,12 @@ setMethod("unique",
 #' of the total count of of the given SparkDataFrame.
 #'
 #' @param x A SparkDataFrame
-#' @param withReplacement Sampling with replacement or not
+#' @param withReplacement Sampling with replacement or not. Default is 
\code{FALSE}.
--- End diff --

we actually want to change to not documenting the default value if it is 
already in the signature - because then it would be the roxygen2 generated doc 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R

2017-09-18 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19243#discussion_r139601790
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -998,33 +998,39 @@ setMethod("unique",
 #' sparkR.session()
 #' path <- "path/to/file.json"
 #' df <- read.json(path)
+#' collect(sample(df, fraction=0.5))
 #' collect(sample(df, FALSE, 0.5))
-#' collect(sample(df, TRUE, 0.5))
+#' collect(sample(df, TRUE, 0.5, seed=3))
--- End diff --

I think the style here is to have one space between the param name and its 
value, like `seed = 3` and `fraction = 0.5` above


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19243: [SPARK-21780][R] Simpler Dataset.sample API in R

2017-09-15 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/19243

[SPARK-21780][R] Simpler Dataset.sample API in R

## What changes were proposed in this pull request?

This PR make `sample(...)` able to omit `withReplacement` defaulting to 
`FALSE`, consistently with equivalent Scala / Java / Python API.

In short, the following examples are allowed:

```r
> df <- createDataFrame(as.list(seq(10)))
> count(sample(df, 0.5, 3))
[1] 4
> count(sample(df, fraction=0.5, seed=3))
[1] 4
> count(sample(df, withReplacement=TRUE, fraction=0.5, seed=3))
[1] 2
> count(sample(df, 1.0))
[1] 10
> count(sample(df, fraction=1.0))
[1] 10
> count(sample(df, FALSE, fraction=1.0))
[1] 10
> count(sample(df, 1.0, withReplacement=FALSE))
[1] 10
```

In addition, this PR also adds some type checking logics as below:

```r
> sample(df)
Error in sample(df) :
  x (required), withReplacement (optional), fraction (required) and seed 
(optional) should be SparkDataFrame, logical, numeric and numeric; however, got 
[SparkDataFrame]
> sample(df, "a")
Error in sample(df, "a") :
  x (required), withReplacement (optional), fraction (required) and seed 
(optional) should be SparkDataFrame, logical, numeric and numeric; however, got 
[SparkDataFrame, character]
> sample(df, TRUE, seed="abc")
Error in sample(df, TRUE, seed = "abc") :
  x (required), withReplacement (optional), fraction (required) and seed 
(optional) should be SparkDataFrame, logical, numeric and numeric; however, got 
[SparkDataFrame, logical, character]
> sample(df, -1.0)
...
Error in sample : illegal argument - requirement failed: Sampling fraction 
(-1.0) must be on interval [0, 1] without replacement
```

## How was this patch tested?

Manually tested, unit tests added in 
`R/pkg/tests/fulltests/test_sparkSQL.R`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-21780

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19243.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19243


commit 680157ef95e5ef4a898e339749d6a8bb2d464991
Author: hyukjinkwon 
Date:   2017-09-15T07:10:09Z

Simpler Dataset.sample API in R




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org