Github user srowen commented on the issue: https://github.com/apache/spark/pull/14948 @yanboliang I'm going to close this PR and instead 'port' a few changes I made in this version for your consideration. Yours should be the primary PR for removing `runs` I think. I'll break out the two other changes in separate PRs. You're right that this is the question regarding init steps -- more steps could make for a better clustering, which could indirectly mean a faster convergence too. Maybe you can check my work. I think the default of 5 was taken from Table 6 in http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf but it's not saying 5 is necessarily an optimal value. In fact Figure 5.2/5.3 imply that (for l/k=2 as we've chosen here) there's virtually no improvement for more than 2 init steps. Coupled with the fact that an init step now takes about 5x longer than a single iteration, it seems like 5 is pretty expensive as a default too.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org