Yes, both run in parallel. Random is a baseline implementation of initialization, which may ignore small clusters. k-means++ improves random initialization by adding weights to points far away to the current candidates. You can view k-means|| as a more scalable version of K-means++. We don't provide k-means++ for initialization, but we used it as part of k-means||. Please check the papers for more details. -Xiangrui
On Wed, Jul 16, 2014 at 10:27 PM, amin mohebbi <aminn_...@yahoo.com> wrote: > Thank you for the response- Can we say that both implementations are > computing the centroids in parallel? I mean in both cases will the data > and code send to workers and the results will be collected and passed to > Driver? and why we have three types of initialization in Mlib ? > Initialization: > • random > • k-means++ > • k-means|| > > > Best Regards > > ....................................................... > > Amin Mohebbi > > PhD candidate in Software Engineering > at university of Malaysia > > H/P : +60 18 2040 017 > > > > E-Mail : tp025...@ex.apiit.edu.my > > amin_...@me.com > > > On Thursday, July 17, 2014 11:57 AM, Xiangrui Meng <men...@gmail.com> wrote: > > > kmeans.py contains a naive implementation of k-means in python, served > as an example of how to use pyspark. Please use MLlib's implementation > in practice. There is a JIRA for making it clear: > https://issues.apache.org/jira/browse/SPARK-2434 > > -Xiangrui > > On Wed, Jul 16, 2014 at 8:16 PM, amin mohebbi <aminn_...@yahoo.com> wrote: >> Can anyone explain to me what is difference between kmeans in Mlib and >> kmeans in examples/src/main/python/kmeans.py? >> >> >> Best Regards >> >> ....................................................... >> >> Amin Mohebbi >> >> PhD candidate in Software Engineering >> at university of Malaysia >> >> H/P : +60 18 2040 017 >> >> >> >> E-Mail : tp025...@ex.apiit.edu.my >> >> amin_...@me.com > >