[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184717#comment-15184717 ] Nick Pentreath commented on SPARK-13600: [~ocp] Could you update this ticket with something about the approach taken of using DataFrame {{approxQuantile}} - perhaps even update the title. Or it may make sense to create a new JIRA for the change to using {{approxQuantile}}, since it is quite an important change. > Incorrect number of buckets in QuantileDiscretizer > -- > > Key: SPARK-13600 > URL: https://issues.apache.org/jira/browse/SPARK-13600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0, 2.0.0 >Reporter: Oliver Pierson >Assignee: Oliver Pierson > > Under certain circumstances, QuantileDiscretizer fails to calculate the > correct splits resulting in an incorrect number of buckets/bins. > E.g. > val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x") > val discretizer = new > QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5) > discretizer.fit(df).getSplits > gives: > Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity) > which corresponds to 6 buckets (not 5). > The problem appears to be in the QuantileDiscretizer.findSplitsCandidates > method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182512#comment-15182512 ] Apache Spark commented on SPARK-13600: -- User 'oliverpierson' has created a pull request for this issue: https://github.com/apache/spark/pull/11553 > Incorrect number of buckets in QuantileDiscretizer > -- > > Key: SPARK-13600 > URL: https://issues.apache.org/jira/browse/SPARK-13600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0, 2.0.0 >Reporter: Oliver Pierson >Assignee: Oliver Pierson > > Under certain circumstances, QuantileDiscretizer fails to calculate the > correct splits resulting in an incorrect number of buckets/bins. > E.g. > val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x") > val discretizer = new > QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5) > discretizer.fit(df).getSplits > gives: > Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity) > which corresponds to 6 buckets (not 5). > The problem appears to be in the QuantileDiscretizer.findSplitsCandidates > method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177392#comment-15177392 ] Xusen Yin commented on SPARK-13600: --- Vote for the new method. > Incorrect number of buckets in QuantileDiscretizer > -- > > Key: SPARK-13600 > URL: https://issues.apache.org/jira/browse/SPARK-13600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0, 2.0.0 >Reporter: Oliver Pierson >Assignee: Oliver Pierson > > Under certain circumstances, QuantileDiscretizer fails to calculate the > correct splits resulting in an incorrect number of buckets/bins. > E.g. > val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x") > val discretizer = new > QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5) > discretizer.fit(df).getSplits > gives: > Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity) > which corresponds to 6 buckets (not 5). > The problem appears to be in the QuantileDiscretizer.findSplitsCandidates > method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176580#comment-15176580 ] Oliver Pierson commented on SPARK-13600: Yeah, you can assign this to me. However, it may be a few days or even a week before I can get a PR together. I was hoping to get people's opinion on reimplementing findSplitCandidates using another [method|https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample] (see the second paragraph of that section). I believe it's done this way in Numpy/Scipy. [~srowen][~mengxr][~yinxusen] > Incorrect number of buckets in QuantileDiscretizer > -- > > Key: SPARK-13600 > URL: https://issues.apache.org/jira/browse/SPARK-13600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0, 2.0.0 >Reporter: Oliver Pierson > > Under certain circumstances, QuantileDiscretizer fails to calculate the > correct splits resulting in an incorrect number of buckets/bins. > E.g. > val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x") > val discretizer = new > QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5) > discretizer.fit(df).getSplits > gives: > Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity) > which corresponds to 6 buckets (not 5). > The problem appears to be in the QuantileDiscretizer.findSplitsCandidates > method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176117#comment-15176117 ] Nick Pentreath commented on SPARK-13600: [~ocp] do you plan to submit a PR? Since you worked on SPARK-13444? > Incorrect number of buckets in QuantileDiscretizer > -- > > Key: SPARK-13600 > URL: https://issues.apache.org/jira/browse/SPARK-13600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0, 2.0.0 >Reporter: Oliver Pierson > > Under certain circumstances, QuantileDiscretizer fails to calculate the > correct splits resulting in an incorrect number of buckets/bins. > E.g. > val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x") > val discretizer = new > QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5) > discretizer.fit(df).getSplits > gives: > Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity) > which corresponds to 6 buckets (not 5). > The problem appears to be in the QuantileDiscretizer.findSplitsCandidates > method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org