[ https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184717#comment-15184717 ]
Nick Pentreath commented on SPARK-13600: ---------------------------------------- [~ocp] Could you update this ticket with something about the approach taken of using DataFrame {{approxQuantile}} - perhaps even update the title. Or it may make sense to create a new JIRA for the change to using {{approxQuantile}}, since it is quite an important change. > Incorrect number of buckets in QuantileDiscretizer > -------------------------------------------------- > > Key: SPARK-13600 > URL: https://issues.apache.org/jira/browse/SPARK-13600 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.6.0, 2.0.0 > Reporter: Oliver Pierson > Assignee: Oliver Pierson > > Under certain circumstances, QuantileDiscretizer fails to calculate the > correct splits resulting in an incorrect number of buckets/bins. > E.g. > val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x") > val discretizer = new > QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5) > discretizer.fit(df).getSplits > gives: > Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity) > which corresponds to 6 buckets (not 5). > The problem appears to be in the QuantileDiscretizer.findSplitsCandidates > method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org