[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer

Nick Pentreath (JIRA) Tue, 08 Mar 2016 01:35:30 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184717#comment-15184717
 ]


Nick Pentreath commented on SPARK-13600:
----------------------------------------

[~ocp] Could you update this ticket with something about the approach taken of 
using DataFrame {{approxQuantile}} - perhaps even update the title. Or it may 
make sense to create a new JIRA for the change to using {{approxQuantile}}, 
since it is quite an important change.

> Incorrect number of buckets in QuantileDiscretizer
> --------------------------------------------------
>
>                 Key: SPARK-13600
>                 URL: https://issues.apache.org/jira/browse/SPARK-13600
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.6.0, 2.0.0
>            Reporter: Oliver Pierson
>            Assignee: Oliver Pierson
>
> Under certain circumstances, QuantileDiscretizer fails to calculate the 
> correct splits resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).
> The problem appears to be in the QuantileDiscretizer.findSplitsCandidates 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer

Reply via email to