I want to run k-means of MLib on a big dataset, it seems for big datsets, we need to perform pre-clustering methods such as canopy clustering. By starting with an initial clustering the number of more expensive distance measurements can be significantly reduced by ignoring points outside of the initial canopies.
I I am not mistaken, in the k-means of MLib, there are three initialization steps : Kmeans ++, Kmeans|| and random . So, can anyone explain to me that can we use kmeans|| instead of canopy clustering? or these two methods act completely different? Best Regards ....................................................... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com