[ 
https://issues.apache.org/jira/browse/SPARK-6068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341634#comment-14341634
 ] 

Sean Owen commented on SPARK-6068:
----------------------------------

Yes, it seems like too much change to the existing version. From 
https://github.com/apache/spark/pull/2634 it seems like there are just some 
differences of opinion about what's worth doing and how. I think the only way 
forward would be to propose integration what you've done for the new version in 
the {{.ml}} package, because it's not clear the existing PR isn't going to 
proceed.

I'm hoping to just drive a resolution to what is almost one big issue rather 
than leave it hanging. I'm looking at the ~8 JIRAs for k-means you created:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20reporter%20%3D%20%22Derrick%20Burns%22%20AND%20resolution%20%3D%20Unresolved

I assume a couple (like this one) are 'back-portable' from your work to the 
existing impl. Can we zap those and close them with a PR? This would be great 
and I'd like to help get those quick wins in.

The rest sound like interdependent aspects of one proposal: create a new 
k-means implementation with different design and properties X / Y / Z, and use 
it in the new pipelines API. (I can't say whether this would be accepted or not 
but that's what's on the table). I'd rather coherently collect that rather than 
have it live in pieces in JIRA, esp. since I'm getting the sense these 
remaining pieces won't otherwise move forward.

> KMeans Parallel test may fail
> -----------------------------
>
>                 Key: SPARK-6068
>                 URL: https://issues.apache.org/jira/browse/SPARK-6068
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.2.1
>            Reporter: Derrick Burns
>              Labels: clustering
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The test  "k-means|| initialization in KMeansSuite can fail when the random 
> number generator is truly random.
> The test is predicated on the assumption that each round of K-Means || will 
> add at least one new cluster center.  The current implementation of K-Means 
> || adds 2*k cluster centers with high probability.  However, there is no 
> deterministic lower bound on the number of cluster centers added.
> Choices are:
> 1)  change the KMeans || implementation to iterate on selecting points until 
> it has satisfied a lower bound on the number of points chosen.
> 2) eliminate the test
> 3) ignore the problem and depend on the random number generator to sample the 
> space in a lucky manner. 
> Option (1) is most in keeping with the contract that KMeans || should provide 
> a precise number of cluster centers when possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to