Github user oliverpierson commented on the pull request: https://github.com/apache/spark/pull/11402#issuecomment-190372118 After running the test on my machine again, I discovered that it randomly passes/fails. It appears that the problem is in [`findSplitsCandidate`](https://github.com/oliverpierson/spark/blob/SPARK-13444/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala#L123). This method will give `n+1` buckets under certain circumstances when only `n` buckets are desired. The reason that the new test randomly passes/fails is because it involves random sampling of the data in order to estimate the quantiles. However, the method can still fail deterministically. For example, consider the following: ``` val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x") val discretizer = new QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5) discretizer.fit(df).getSplits ``` This gives the following splits: ``` Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity) ``` which corresponds to six buckets. There are a few ways to fix `findSplitCandidates`. The most straightforward (albeit, less elegant) way is to track the number of splits discovered so far while iterating the `while` loop and terminate the loop when `(index < valueCounts.length && splitsSoFar < numSplits)`. I believe this is probably the best option for the bug in `branch-1.6`. If there's no objections I can put a commit together. As for the `master` branch, I'm considering rewriting the `findSplitCandidates` method using [the usual method for finding quantiles.](https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample) It's done this way in Numpy/Scipy and I believe it would be at least as fast as the current routine. I'm curious if anybody has any objections or concerns when it comes to rewrite?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org