[ 
https://issues.apache.org/jira/browse/SPARK-6068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341316#comment-14341316
 ] 

Derrick Burns commented on SPARK-6068:
--------------------------------------

Not theoretical. The unit test failed for me in my project (branch of Spark 1.1 
mllib). I spent 20 minutes stepping through the code before I realized that 
there was no (new) problem with my implementation of KMeansParallel but instead 
was a pre-existing problem. I confirmed via visual inspection that the problem 
still exists in a recent branch of the Spark.  

In this case the fix is simple: during kmeans parallel maintain a count of the 
number of centers actually added for each run. Terminate the iterations of the 
loop that add 2k points when both the step limit has been reached AND the 
minimum number of k points have been made centers. 

With this change, the high probability case will be  unaffected (I.e. no extra 
iterations) while the low probability case that gets hit in the unit test will 
be covered as well.

I've abandoned the "big bang multi-JIRA PR" that I submitted. It was too much 
at one time, as you point out. However, that is what the Spark clusterer 
requires in order to offer the flexibility that I demonstrate in my branch....  
I put a lot of time into the re-architecture including many many hours of large 
scale testing. Perhaps others can benefit from that by forking or using an 
upcoming release of my branch. 

Sent from my iPhone



> KMeans Parallel test may fail
> -----------------------------
>
>                 Key: SPARK-6068
>                 URL: https://issues.apache.org/jira/browse/SPARK-6068
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.2.1
>            Reporter: Derrick Burns
>              Labels: clustering
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The test  "k-means|| initialization in KMeansSuite can fail when the random 
> number generator is truly random.
> The test is predicated on the assumption that each round of K-Means || will 
> add at least one new cluster center.  The current implementation of K-Means 
> || adds 2*k cluster centers with high probability.  However, there is no 
> deterministic lower bound on the number of cluster centers added.
> Choices are:
> 1)  change the KMeans || implementation to iterate on selecting points until 
> it has satisfied a lower bound on the number of points chosen.
> 2) eliminate the test
> 3) ignore the problem and depend on the random number generator to sample the 
> space in a lucky manner. 
> Option (1) is most in keeping with the contract that KMeans || should provide 
> a precise number of cluster centers when possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to