[ 
https://issues.apache.org/jira/browse/SPARK-6068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341326#comment-14341326
 ] 

Sean Owen commented on SPARK-6068:
----------------------------------

Is this something for which a PR can easily be created then? it sounds like 
you're saying you have fixed it in your copy and that bit still resembles the 
original code here. Or if you'll point me at it I can try to extract the change.

On a broader question, IMHO:

I think you are ultimately creating a fairly different implementation in order 
to get in your improvements, and it's quite hard to propose a radical change to 
the implementation here. Especially if it changes the API or behaviors people 
are using. It's a shame that any library can only reasonably contain one 
implementation of a thing, but of course, nobody said MLlib is supposed to have 
everything or every bell and whistle, and we can and should be able to drop in 
other implementations in a Spark program as we like.

Overhauls are possible at inflection points in a project lifecycle, and there 
is of course the 'pipelines' API rewrite going on now, note. I don't have any 
view on how realistic it is to drop in your work as the new implementation 
there.

Failing that, I wonder if a lot of the improvements you've suggested here 
require a substantial rewrite, and won't realistically happen on the current 
impl? For those, I might suggested withdrawing the existing PRs / JIRAs and 
instead leave one placeholder JIRA summarizing the key features you'd like to 
see in a future rewrite, and track it as a feature request. It's up to your 
judgment but I suggest it since you say you abandoned the open PR.

It would still be good to get in any smaller clear-win changes that can be 
'back-ported'.

> KMeans Parallel test may fail
> -----------------------------
>
>                 Key: SPARK-6068
>                 URL: https://issues.apache.org/jira/browse/SPARK-6068
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.2.1
>            Reporter: Derrick Burns
>              Labels: clustering
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The test  "k-means|| initialization in KMeansSuite can fail when the random 
> number generator is truly random.
> The test is predicated on the assumption that each round of K-Means || will 
> add at least one new cluster center.  The current implementation of K-Means 
> || adds 2*k cluster centers with high probability.  However, there is no 
> deterministic lower bound on the number of cluster centers added.
> Choices are:
> 1)  change the KMeans || implementation to iterate on selecting points until 
> it has satisfied a lower bound on the number of points chosen.
> 2) eliminate the test
> 3) ignore the problem and depend on the random number generator to sample the 
> space in a lucky manner. 
> Option (1) is most in keeping with the contract that KMeans || should provide 
> a precise number of cluster centers when possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to