spark git commit: [SPARK-19590][PYSPARK][ML] Update the document for QuantileDiscretizer in pyspark

holden Wed, 15 Feb 2017 10:12:21 -0800

Repository: spark
Updated Branches:
  refs/heads/master acf71c63c -> 6eca21ba8



[SPARK-19590][PYSPARK][ML] Update the document for QuantileDiscretizer in 
pyspark

## What changes were proposed in this pull request?
This PR is to document the changes on QuantileDiscretizer in pyspark for PR:
https://github.com/apache/spark/pull/15428

## How was this patch tested?
No test needed

Signed-off-by: VinceShieh <vincent.xieintel.com>

Author: VinceShieh <vincent....@intel.com>

Closes #16922 from VinceShieh/spark-19590.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6eca21ba
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6eca21ba
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6eca21ba

Branch: refs/heads/master
Commit: 6eca21ba881120f1ac7854621380ef8a92972384
Parents: acf71c6
Author: VinceShieh <vincent....@intel.com>
Authored: Wed Feb 15 10:12:07 2017 -0800
Committer: Holden Karau <hol...@us.ibm.com>
Committed: Wed Feb 15 10:12:07 2017 -0800

----------------------------------------------------------------------
 python/pyspark/ml/feature.py | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/6eca21ba/python/pyspark/ml/feature.py
----------------------------------------------------------------------
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index ac90c89..1ab4291 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -1178,7 +1178,17 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadab
 
     `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
     categorical features. The number of bins can be set using the 
:py:attr:`numBuckets` parameter.
-    The bin ranges are chosen using an approximate algorithm (see the 
documentation for
+    It is possible that the number of buckets used will be less than this 
value, for example, if
+    there are too few distinct values of the input to create enough distinct 
quantiles.
+
+    NaN handling: Note also that
+    QuantileDiscretizer will raise an error when it finds NaN values in the 
dataset, but the user
+    can also choose to either keep or remove NaN values within the dataset by 
setting
+    :py:attr:`handleInvalid` parameter. If the user chooses to keep NaN 
values, they will be
+    handled specially and placed into their own bucket, for example, if 4 
buckets are used, then
+    non-NaN data will be put into buckets[0-3], but NaNs will be counted in a 
special bucket[4].
+
+    Algorithm: The bin ranges are chosen using an approximate algorithm (see 
the documentation for
     :py:meth:`~.DataFrameStatFunctions.approxQuantile` for a detailed 
description).
     The precision of the approximation can be controlled with the
     :py:attr:`relativeError` parameter.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19590][PYSPARK][ML] Update the document for QuantileDiscretizer in pyspark

Reply via email to