[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...

srowen Sun, 04 Sep 2016 04:16:45 -0700

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14948
  
    @yanboliang I'm going to close this PR and instead 'port' a few changes I 
made in this version for your consideration. Yours should be the primary PR for 
removing `runs` I think. I'll break out the two other changes in separate PRs.
    
    You're right that this is the question regarding init steps -- more steps 
could make for a better clustering, which could indirectly mean a faster 
convergence too. Maybe you can check my work. I think the default of 5 was 
taken from Table 6 in 
http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf but it's not saying 
5 is necessarily an optimal value. In fact Figure 5.2/5.3 imply that (for l/k=2 
as we've chosen here) there's virtually no improvement for more than 2 init 
steps.
    
    Coupled with the fact that an init step now takes about 5x longer than a 
single iteration, it seems like 5 is pretty expensive as a default too.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14948: [SPARK-17389] [SPARK-3261] [MLLIB] Significant KMeans sp...

Reply via email to