Repository: spark Updated Branches: refs/heads/master 310454be3 -> 07a2b8738
[SPARK-21778][SQL] Simpler Dataset.sample API in Scala / Java ## What changes were proposed in this pull request? Dataset.sample requires a boolean flag withReplacement as the first argument. However, most of the time users simply want to sample some records without replacement. This ticket introduces a new sample function that simply takes in the fraction and seed. ## How was this patch tested? Tested manually. Not sure yet if we should add a test case for just this wrapper ... Author: Reynold Xin <r...@databricks.com> Closes #18988 from rxin/SPARK-21778. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07a2b873 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07a2b873 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07a2b873 Branch: refs/heads/master Commit: 07a2b8738ed8e6c136545d03f91a865de05e41a0 Parents: 310454b Author: Reynold Xin <r...@databricks.com> Authored: Fri Aug 18 23:58:20 2017 +0900 Committer: hyukjinkwon <gurwls...@gmail.com> Committed: Fri Aug 18 23:58:20 2017 +0900 ---------------------------------------------------------------------- .../scala/org/apache/spark/sql/Dataset.scala | 36 ++++++++++++++++++-- 1 file changed, 34 insertions(+), 2 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/07a2b873/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---------------------------------------------------------------------- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index a9887eb..615686c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -1849,10 +1849,42 @@ class Dataset[T] private[sql]( } /** + * Returns a new [[Dataset]] by sampling a fraction of rows (without replacement), + * using a user-supplied seed. + * + * @param fraction Fraction of rows to generate, range [0.0, 1.0]. + * @param seed Seed for sampling. + * + * @note This is NOT guaranteed to provide exactly the fraction of the count + * of the given [[Dataset]]. + * + * @group typedrel + * @since 2.3.0 + */ + def sample(fraction: Double, seed: Long): Dataset[T] = { + sample(withReplacement = false, fraction = fraction, seed = seed) + } + + /** + * Returns a new [[Dataset]] by sampling a fraction of rows (without replacement). + * + * @param fraction Fraction of rows to generate, range [0.0, 1.0]. + * + * @note This is NOT guaranteed to provide exactly the fraction of the count + * of the given [[Dataset]]. + * + * @group typedrel + * @since 2.3.0 + */ + def sample(fraction: Double): Dataset[T] = { + sample(withReplacement = false, fraction = fraction) + } + + /** * Returns a new [[Dataset]] by sampling a fraction of rows, using a user-supplied seed. * * @param withReplacement Sample with replacement or not. - * @param fraction Fraction of rows to generate. + * @param fraction Fraction of rows to generate, range [0.0, 1.0]. * @param seed Seed for sampling. * * @note This is NOT guaranteed to provide exactly the fraction of the count @@ -1871,7 +1903,7 @@ class Dataset[T] private[sql]( * Returns a new [[Dataset]] by sampling a fraction of rows, using a random seed. * * @param withReplacement Sample with replacement or not. - * @param fraction Fraction of rows to generate. + * @param fraction Fraction of rows to generate, range [0.0, 1.0]. * * @note This is NOT guaranteed to provide exactly the fraction of the total count * of the given [[Dataset]]. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org