spark git commit: [SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark QuantileDiscretizer

mlnick Tue, 24 May 2016 01:02:59 -0700

Repository: spark
Updated Branches:
  refs/heads/master d642b2735 -> 6075f5b4d



[SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark 
QuantileDiscretizer

This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to 
match Scala.

Also cleaned up a duplication of `numBuckets` where the param is both a class 
and instance attribute (I removed the instance attr to match the style of 
params throughout `ml`).

Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now 
uses `approxQuantile`.

## How was this patch tested?

A little doctest and built API docs locally to check HTML doc generation.

Author: Nick Pentreath <ni...@za.ibm.com>

Closes #13228 from MLnick/SPARK-15442-py-relerror-param.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6075f5b4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6075f5b4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6075f5b4

Branch: refs/heads/master
Commit: 6075f5b4d8e98483d26c31576f58e2229024b4f4
Parents: d642b27
Author: Nick Pentreath <ni...@za.ibm.com>
Authored: Tue May 24 10:02:10 2016 +0200
Committer: Nick Pentreath <ni...@za.ibm.com>
Committed: Tue May 24 10:02:10 2016 +0200

----------------------------------------------------------------------
 .../spark/ml/feature/QuantileDiscretizer.scala  | 13 +++--
 python/pyspark/ml/feature.py                    | 51 ++++++++++++++------
 2 files changed, 44 insertions(+), 20 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/6075f5b4/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
index 5a6daa0..6148359 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
@@ -50,13 +50,13 @@ private[feature] trait QuantileDiscretizerBase extends 
Params
   /**
    * Relative error (see documentation for
    * [[org.apache.spark.sql.DataFrameStatFunctions.approxQuantile 
approxQuantile]] for description)
-   * Must be a number in [0, 1].
+   * Must be in the range [0, 1].
    * default: 0.001
    * @group param
    */
   val relativeError = new DoubleParam(this, "relativeError", "The relative 
target precision " +
-    "for approxQuantile",
-    ParamValidators.inRange(0.0, 1.0))
+    "for the approximate quantile algorithm used to generate buckets. " +
+    "Must be in the range [0, 1].", ParamValidators.inRange(0.0, 1.0))
   setDefault(relativeError -> 0.001)
 
   /** @group getParam */
@@ -66,8 +66,11 @@ private[feature] trait QuantileDiscretizerBase extends Params
 /**
  * :: Experimental ::
  * `QuantileDiscretizer` takes a column with continuous features and outputs a 
column with binned
- * categorical features. The bin ranges are chosen by taking a sample of the 
data and dividing it
- * into roughly equal parts. The lower and upper bin bounds will be -Infinity 
and +Infinity,
+ * categorical features. The number of bins can be set using the `numBuckets` 
parameter.
+ * The bin ranges are chosen using an approximate algorithm (see the 
documentation for
+ * [[org.apache.spark.sql.DataFrameStatFunctions.approxQuantile 
approxQuantile]]
+ * for a detailed description). The precision of the approximation can be 
controlled with the
+ * `relativeError` parameter. The lower and upper bin bounds will be 
`-Infinity` and `+Infinity`,
  * covering all real values.
  */
 @Experimental

http://git-wip-us.apache.org/repos/asf/spark/blob/6075f5b4/python/pyspark/ml/feature.py
----------------------------------------------------------------------
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 93745c7..eb555cb 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -1177,16 +1177,20 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, 
HasOutputCol, HasSeed, Jav
     .. note:: Experimental
 
     `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
-    categorical features. The bin ranges are chosen by taking a sample of the 
data and dividing it
-    into roughly equal parts. The lower and upper bin bounds will be -Infinity 
and +Infinity,
-    covering all real values. This attempts to find numBuckets partitions 
based on a sample of data,
-    but it may find fewer depending on the data sample values.
+    categorical features. The number of bins can be set using the 
:py:attr:`numBuckets` parameter.
+    The bin ranges are chosen using an approximate algorithm (see the 
documentation for
+    :py:meth:`~.DataFrameStatFunctions.approxQuantile` for a detailed 
description).
+    The precision of the approximation can be controlled with the
+    :py:attr:`relativeError` parameter.
+    The lower and upper bin bounds will be `-Infinity` and `+Infinity`, 
covering all real values.
 
     >>> df = spark.createDataFrame([(0.1,), (0.4,), (1.2,), (1.5,)], 
["values"])
     >>> qds = QuantileDiscretizer(numBuckets=2,
-    ...     inputCol="values", outputCol="buckets", seed=123)
+    ...     inputCol="values", outputCol="buckets", seed=123, 
relativeError=0.01)
     >>> qds.getSeed()
     123
+    >>> qds.getRelativeError()
+    0.01
     >>> bucketizer = qds.fit(df)
     >>> splits = bucketizer.getSplits()
     >>> splits[0]
@@ -1205,32 +1209,35 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, 
HasOutputCol, HasSeed, Jav
     .. versionadded:: 2.0.0
     """
 
-    # a placeholder to make it appear in the generated doc
     numBuckets = Param(Params._dummy(), "numBuckets",
                        "Maximum number of buckets (quantiles, or " +
-                       "categories) into which data points are grouped. Must 
be >= 2. Default 2.",
+                       "categories) into which data points are grouped. Must 
be >= 2.",
                        typeConverter=TypeConverters.toInt)
 
+    relativeError = Param(Params._dummy(), "relativeError", "The relative 
target precision for " +
+                          "the approximate quantile algorithm used to generate 
buckets. " +
+                          "Must be in the range [0, 1].",
+                          typeConverter=TypeConverters.toFloat)
+
     @keyword_only
-    def __init__(self, numBuckets=2, inputCol=None, outputCol=None, seed=None):
+    def __init__(self, numBuckets=2, inputCol=None, outputCol=None, seed=None, 
relativeError=0.001):
         """
-        __init__(self, numBuckets=2, inputCol=None, outputCol=None, seed=None)
+        __init__(self, numBuckets=2, inputCol=None, outputCol=None, seed=None, 
relativeError=0.001)
         """
         super(QuantileDiscretizer, self).__init__()
         self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.QuantileDiscretizer",
                                             self.uid)
-        self.numBuckets = Param(self, "numBuckets",
-                                "Maximum number of buckets (quantiles, or " +
-                                "categories) into which data points are 
grouped. Must be >= 2.")
-        self._setDefault(numBuckets=2)
+        self._setDefault(numBuckets=2, relativeError=0.001)
         kwargs = self.__init__._input_kwargs
         self.setParams(**kwargs)
 
     @keyword_only
     @since("2.0.0")
-    def setParams(self, numBuckets=2, inputCol=None, outputCol=None, 
seed=None):
+    def setParams(self, numBuckets=2, inputCol=None, outputCol=None, seed=None,
+                  relativeError=0.001):
         """
-        setParams(self, numBuckets=2, inputCol=None, outputCol=None, seed=None)
+        setParams(self, numBuckets=2, inputCol=None, outputCol=None, 
seed=None, \
+                  relativeError=0.001)
         Set the params for the QuantileDiscretizer
         """
         kwargs = self.setParams._input_kwargs
@@ -1250,6 +1257,20 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, 
HasOutputCol, HasSeed, Jav
         """
         return self.getOrDefault(self.numBuckets)
 
+    @since("2.0.0")
+    def setRelativeError(self, value):
+        """
+        Sets the value of :py:attr:`relativeError`.
+        """
+        return self._set(relativeError=value)
+
+    @since("2.0.0")
+    def getRelativeError(self):
+        """
+        Gets the value of relativeError or its default value.
+        """
+        return self.getOrDefault(self.relativeError)
+
     def _create_model(self, java_model):
         """
         Private method to convert the java_model to a Python model.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark QuantileDiscretizer

Reply via email to