Github user oliverpierson commented on the pull request:

    https://github.com/apache/spark/pull/11402#issuecomment-190372118
  
    After running the test on my machine again, I discovered that it randomly 
passes/fails.  It appears that the problem is in 
[`findSplitsCandidate`](https://github.com/oliverpierson/spark/blob/SPARK-13444/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala#L123).
  This method will give `n+1` buckets under certain circumstances when only `n` 
buckets are desired.  The reason that the new test randomly passes/fails is 
because it involves random sampling of the data in order to estimate the 
quantiles.  
    
    However, the method can still fail deterministically.  For example, 
consider the following:
    ```
    val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
    val discretizer = new 
QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
    discretizer.fit(df).getSplits
    ```
    This gives the following splits:
    ```
    Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
    ```
    which corresponds to six buckets.
    
    There are a few ways to fix `findSplitCandidates`.  The most 
straightforward (albeit, less elegant) way is to track the number of splits 
discovered so far while iterating the `while` loop and terminate the loop when 
`(index < valueCounts.length && splitsSoFar < numSplits)`.  I believe this is 
probably the best option for the bug in `branch-1.6`.  If there's no objections 
I can put a commit together.
    
    As for the `master` branch, I'm considering rewriting the 
`findSplitCandidates` method using [the usual method for finding 
quantiles.](https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample)
  It's done this way in Numpy/Scipy and I believe it would be at least as fast 
as the current routine.  I'm curious if anybody has any objections or concerns 
when it comes to rewrite?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to