[ https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186648#comment-15186648 ]
Nick Pentreath commented on SPARK-13600: ---------------------------------------- Thanks, that's fine > Use approxQuantile from DataFrame stats in QuantileDiscretizer > -------------------------------------------------------------- > > Key: SPARK-13600 > URL: https://issues.apache.org/jira/browse/SPARK-13600 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.6.0, 2.0.0 > Reporter: Oliver Pierson > Assignee: Oliver Pierson > > For consistency and code reuse, QuantileDiscretizer should use approxQuantile > to find splits in the data rather than implement it's own method. > Additionally, making this change should remedy a bug where > QuantileDiscretizer fails to calculate the correct splits in certain > circumstances, resulting in an incorrect number of buckets/bins. > E.g. > val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x") > val discretizer = new > QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5) > discretizer.fit(df).getSplits > gives: > Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity) > which corresponds to 6 buckets (not 5). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org