spark git commit: [SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames

srowen Thu, 25 Feb 2016 05:27:50 -0800

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 3cc938ac8 -> cb869a143



[SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames

Change line 113 of QuantileDiscretizer.scala to

`val requiredSamples = math.max(numBins * numBins, 10000.0)`

so that `requiredSamples` is a `Double`.  This will fix the division in line 
114 which currently results in zero if `requiredSamples < dataset.count`

Manual tests.  I was having a problems using QuantileDiscretizer with my a 
dataset and after making this change QuantileDiscretizer behaves as expected.

Author: Oliver Pierson <o...@gatech.edu>
Author: Oliver Pierson <opier...@umd.edu>

Closes #11319 from oliverpierson/SPARK-13444.

(cherry picked from commit 6f8e835c68dff6fcf97326dc617132a41ff9d043)
Signed-off-by: Sean Owen <so...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cb869a14
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cb869a14
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cb869a14

Branch: refs/heads/branch-1.6
Commit: cb869a143d338985c3d99ef388dd78b1e3d90a73
Parents: 3cc938a
Author: Oliver Pierson <o...@gatech.edu>
Authored: Thu Feb 25 13:24:46 2016 +0000
Committer: Sean Owen <so...@cloudera.com>
Committed: Thu Feb 25 13:27:10 2016 +0000

----------------------------------------------------------------------
 .../spark/ml/feature/QuantileDiscretizer.scala  | 11 +++++++++--
 .../ml/feature/QuantileDiscretizerSuite.scala   | 20 ++++++++++++++++++++
 2 files changed, 29 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/cb869a14/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
index 7bf67c6..cd5085a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
@@ -97,6 +97,13 @@ final class QuantileDiscretizer(override val uid: String)
 
 @Since("1.6.0")
 object QuantileDiscretizer extends DefaultParamsReadable[QuantileDiscretizer] 
with Logging {
+
+  /**
+   * Minimum number of samples required for finding splits, regardless of 
number of bins.  If
+   * the dataset has fewer rows than this value, the entire dataset will be 
used.
+   */
+  private[spark] val minSamplesRequired: Int = 10000
+
   /**
    * Sampling from the given dataset to collect quantile statistics.
    */
@@ -104,8 +111,8 @@ object QuantileDiscretizer extends 
DefaultParamsReadable[QuantileDiscretizer] wi
     val totalSamples = dataset.count()
     require(totalSamples > 0,
       "QuantileDiscretizer requires non-empty input dataset but was given an 
empty input.")
-    val requiredSamples = math.max(numBins * numBins, 10000)
-    val fraction = math.min(requiredSamples / dataset.count(), 1.0)
+    val requiredSamples = math.max(numBins * numBins, minSamplesRequired)
+    val fraction = math.min(requiredSamples.toDouble / dataset.count(), 1.0)
     dataset.sample(withReplacement = false, fraction, new 
XORShiftRandom().nextInt()).collect()
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/cb869a14/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
index 3a4f6d2..32bfa43 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
@@ -71,6 +71,26 @@ class QuantileDiscretizerSuite
     }
   }
 
+  test("Test splits on dataset larger than minSamplesRequired") {
+    val sqlCtx = SQLContext.getOrCreate(sc)
+    import sqlCtx.implicits._
+
+    val datasetSize = QuantileDiscretizer.minSamplesRequired + 1
+    val numBuckets = 5
+    val df = sc.parallelize((1.0 to datasetSize by 
1.0).map(Tuple1.apply)).toDF("input")
+    val discretizer = new QuantileDiscretizer()
+      .setInputCol("input")
+      .setOutputCol("result")
+      .setNumBuckets(numBuckets)
+      .setSeed(1)
+
+    val result = discretizer.fit(df).transform(df)
+    val observedNumBuckets = result.select("result").distinct.count
+
+    assert(observedNumBuckets === numBuckets,
+      "Observed number of buckets does not equal expected number of buckets.")
+  }
+
   test("read/write") {
     val t = new QuantileDiscretizer()
       .setInputCol("myInputCol")


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames

Reply via email to