GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/19243

    [SPARK-21780][R] Simpler Dataset.sample API in R

    ## What changes were proposed in this pull request?
    
    This PR make `sample(...)` able to omit `withReplacement` defaulting to 
`FALSE`, consistently with equivalent Scala / Java / Python API.
    
    In short, the following examples are allowed:
    
    ```r
    > df <- createDataFrame(as.list(seq(10)))
    > count(sample(df, 0.5, 3))
    [1] 4
    > count(sample(df, fraction=0.5, seed=3))
    [1] 4
    > count(sample(df, withReplacement=TRUE, fraction=0.5, seed=3))
    [1] 2
    > count(sample(df, 1.0))
    [1] 10
    > count(sample(df, fraction=1.0))
    [1] 10
    > count(sample(df, FALSE, fraction=1.0))
    [1] 10
    > count(sample(df, 1.0, withReplacement=FALSE))
    [1] 10
    ```
    
    In addition, this PR also adds some type checking logics as below:
    
    ```r
    > sample(df)
    Error in sample(df) :
      x (required), withReplacement (optional), fraction (required) and seed 
(optional) should be SparkDataFrame, logical, numeric and numeric; however, got 
[SparkDataFrame]
    > sample(df, "a")
    Error in sample(df, "a") :
      x (required), withReplacement (optional), fraction (required) and seed 
(optional) should be SparkDataFrame, logical, numeric and numeric; however, got 
[SparkDataFrame, character]
    > sample(df, TRUE, seed="abc")
    Error in sample(df, TRUE, seed = "abc") :
      x (required), withReplacement (optional), fraction (required) and seed 
(optional) should be SparkDataFrame, logical, numeric and numeric; however, got 
[SparkDataFrame, logical, character]
    > sample(df, -1.0)
    ...
    Error in sample : illegal argument - requirement failed: Sampling fraction 
(-1.0) must be on interval [0, 1] without replacement
    ```
    
    ## How was this patch tested?
    
    Manually tested, unit tests added in 
`R/pkg/tests/fulltests/test_sparkSQL.R`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-21780

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19243.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19243
    
----
commit 680157ef95e5ef4a898e339749d6a8bb2d464991
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2017-09-15T07:10:09Z

    Simpler Dataset.sample API in R

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to