Repository: spark
Updated Branches:
  refs/heads/master 310454be3 -> 07a2b8738


[SPARK-21778][SQL] Simpler Dataset.sample API in Scala / Java

## What changes were proposed in this pull request?
Dataset.sample requires a boolean flag withReplacement as the first argument. 
However, most of the time users simply want to sample some records without 
replacement. This ticket introduces a new sample function that simply takes in 
the fraction and seed.

## How was this patch tested?
Tested manually. Not sure yet if we should add a test case for just this 
wrapper ...

Author: Reynold Xin <r...@databricks.com>

Closes #18988 from rxin/SPARK-21778.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07a2b873
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07a2b873
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07a2b873

Branch: refs/heads/master
Commit: 07a2b8738ed8e6c136545d03f91a865de05e41a0
Parents: 310454b
Author: Reynold Xin <r...@databricks.com>
Authored: Fri Aug 18 23:58:20 2017 +0900
Committer: hyukjinkwon <gurwls...@gmail.com>
Committed: Fri Aug 18 23:58:20 2017 +0900

----------------------------------------------------------------------
 .../scala/org/apache/spark/sql/Dataset.scala    | 36 ++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/07a2b873/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index a9887eb..615686c 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -1849,10 +1849,42 @@ class Dataset[T] private[sql](
   }
 
   /**
+   * Returns a new [[Dataset]] by sampling a fraction of rows (without 
replacement),
+   * using a user-supplied seed.
+   *
+   * @param fraction Fraction of rows to generate, range [0.0, 1.0].
+   * @param seed Seed for sampling.
+   *
+   * @note This is NOT guaranteed to provide exactly the fraction of the count
+   * of the given [[Dataset]].
+   *
+   * @group typedrel
+   * @since 2.3.0
+   */
+  def sample(fraction: Double, seed: Long): Dataset[T] = {
+    sample(withReplacement = false, fraction = fraction, seed = seed)
+  }
+
+  /**
+   * Returns a new [[Dataset]] by sampling a fraction of rows (without 
replacement).
+   *
+   * @param fraction Fraction of rows to generate, range [0.0, 1.0].
+   *
+   * @note This is NOT guaranteed to provide exactly the fraction of the count
+   * of the given [[Dataset]].
+   *
+   * @group typedrel
+   * @since 2.3.0
+   */
+  def sample(fraction: Double): Dataset[T] = {
+    sample(withReplacement = false, fraction = fraction)
+  }
+
+  /**
    * Returns a new [[Dataset]] by sampling a fraction of rows, using a 
user-supplied seed.
    *
    * @param withReplacement Sample with replacement or not.
-   * @param fraction Fraction of rows to generate.
+   * @param fraction Fraction of rows to generate, range [0.0, 1.0].
    * @param seed Seed for sampling.
    *
    * @note This is NOT guaranteed to provide exactly the fraction of the count
@@ -1871,7 +1903,7 @@ class Dataset[T] private[sql](
    * Returns a new [[Dataset]] by sampling a fraction of rows, using a random 
seed.
    *
    * @param withReplacement Sample with replacement or not.
-   * @param fraction Fraction of rows to generate.
+   * @param fraction Fraction of rows to generate, range [0.0, 1.0].
    *
    * @note This is NOT guaranteed to provide exactly the fraction of the total 
count
    * of the given [[Dataset]].


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to