[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer

2016-03-08 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184717#comment-15184717
 ] 

Nick Pentreath commented on SPARK-13600:


[~ocp] Could you update this ticket with something about the approach taken of 
using DataFrame {{approxQuantile}} - perhaps even update the title. Or it may 
make sense to create a new JIRA for the change to using {{approxQuantile}}, 
since it is quite an important change.

> Incorrect number of buckets in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>Assignee: Oliver Pierson
>
> Under certain circumstances, QuantileDiscretizer fails to calculate the 
> correct splits resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).
> The problem appears to be in the QuantileDiscretizer.findSplitsCandidates 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer

2016-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182512#comment-15182512
 ] 

Apache Spark commented on SPARK-13600:
--

User 'oliverpierson' has created a pull request for this issue:
https://github.com/apache/spark/pull/11553

> Incorrect number of buckets in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>Assignee: Oliver Pierson
>
> Under certain circumstances, QuantileDiscretizer fails to calculate the 
> correct splits resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).
> The problem appears to be in the QuantileDiscretizer.findSplitsCandidates 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer

2016-03-02 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177392#comment-15177392
 ] 

Xusen Yin commented on SPARK-13600:
---

Vote for the new method.

> Incorrect number of buckets in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>Assignee: Oliver Pierson
>
> Under certain circumstances, QuantileDiscretizer fails to calculate the 
> correct splits resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).
> The problem appears to be in the QuantileDiscretizer.findSplitsCandidates 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer

2016-03-02 Thread Oliver Pierson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176580#comment-15176580
 ] 

Oliver Pierson commented on SPARK-13600:


Yeah, you can assign this to me.  However, it may be a few days or even a week 
before I can get a PR together.  I was hoping to get people's opinion on 
reimplementing findSplitCandidates using another 
[method|https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample]
 (see the second paragraph of that section). I believe it's done this way in 
Numpy/Scipy. [~srowen][~mengxr][~yinxusen]

> Incorrect number of buckets in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>
> Under certain circumstances, QuantileDiscretizer fails to calculate the 
> correct splits resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).
> The problem appears to be in the QuantileDiscretizer.findSplitsCandidates 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer

2016-03-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176117#comment-15176117
 ] 

Nick Pentreath commented on SPARK-13600:


[~ocp] do you plan to submit a PR? Since you worked on SPARK-13444?

> Incorrect number of buckets in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>
> Under certain circumstances, QuantileDiscretizer fails to calculate the 
> correct splits resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).
> The problem appears to be in the QuantileDiscretizer.findSplitsCandidates 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org