Repository: spark Updated Branches: refs/heads/master acf71c63c -> 6eca21ba8
[SPARK-19590][PYSPARK][ML] Update the document for QuantileDiscretizer in pyspark ## What changes were proposed in this pull request? This PR is to document the changes on QuantileDiscretizer in pyspark for PR: https://github.com/apache/spark/pull/15428 ## How was this patch tested? No test needed Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent....@intel.com> Closes #16922 from VinceShieh/spark-19590. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6eca21ba Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6eca21ba Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6eca21ba Branch: refs/heads/master Commit: 6eca21ba881120f1ac7854621380ef8a92972384 Parents: acf71c6 Author: VinceShieh <vincent....@intel.com> Authored: Wed Feb 15 10:12:07 2017 -0800 Committer: Holden Karau <hol...@us.ibm.com> Committed: Wed Feb 15 10:12:07 2017 -0800 ---------------------------------------------------------------------- python/pyspark/ml/feature.py | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/6eca21ba/python/pyspark/ml/feature.py ---------------------------------------------------------------------- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index ac90c89..1ab4291 100755 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -1178,7 +1178,17 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the :py:attr:`numBuckets` parameter. - The bin ranges are chosen using an approximate algorithm (see the documentation for + It is possible that the number of buckets used will be less than this value, for example, if + there are too few distinct values of the input to create enough distinct quantiles. + + NaN handling: Note also that + QuantileDiscretizer will raise an error when it finds NaN values in the dataset, but the user + can also choose to either keep or remove NaN values within the dataset by setting + :py:attr:`handleInvalid` parameter. If the user chooses to keep NaN values, they will be + handled specially and placed into their own bucket, for example, if 4 buckets are used, then + non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4]. + + Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for :py:meth:`~.DataFrameStatFunctions.approxQuantile` for a detailed description). The precision of the approximation can be controlled with the :py:attr:`relativeError` parameter. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org