GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/19243
[SPARK-21780][R] Simpler Dataset.sample API in R ## What changes were proposed in this pull request? This PR make `sample(...)` able to omit `withReplacement` defaulting to `FALSE`, consistently with equivalent Scala / Java / Python API. In short, the following examples are allowed: ```r > df <- createDataFrame(as.list(seq(10))) > count(sample(df, 0.5, 3)) [1] 4 > count(sample(df, fraction=0.5, seed=3)) [1] 4 > count(sample(df, withReplacement=TRUE, fraction=0.5, seed=3)) [1] 2 > count(sample(df, 1.0)) [1] 10 > count(sample(df, fraction=1.0)) [1] 10 > count(sample(df, FALSE, fraction=1.0)) [1] 10 > count(sample(df, 1.0, withReplacement=FALSE)) [1] 10 ``` In addition, this PR also adds some type checking logics as below: ```r > sample(df) Error in sample(df) : x (required), withReplacement (optional), fraction (required) and seed (optional) should be SparkDataFrame, logical, numeric and numeric; however, got [SparkDataFrame] > sample(df, "a") Error in sample(df, "a") : x (required), withReplacement (optional), fraction (required) and seed (optional) should be SparkDataFrame, logical, numeric and numeric; however, got [SparkDataFrame, character] > sample(df, TRUE, seed="abc") Error in sample(df, TRUE, seed = "abc") : x (required), withReplacement (optional), fraction (required) and seed (optional) should be SparkDataFrame, logical, numeric and numeric; however, got [SparkDataFrame, logical, character] > sample(df, -1.0) ... Error in sample : illegal argument - requirement failed: Sampling fraction (-1.0) must be on interval [0, 1] without replacement ``` ## How was this patch tested? Manually tested, unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-21780 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19243.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19243 ---- commit 680157ef95e5ef4a898e339749d6a8bb2d464991 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2017-09-15T07:10:09Z Simpler Dataset.sample API in R ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org