KMeans is iterative, you can't do anything about it. Syncs are needed, they are in mapreduce much more costly than in BSP. If you follow the mahout mailing list, Ted Dunning has found a paper for a kmeans that is running in a single pass. So there is no iteration in there. [1][2]
Also, why is global sync very expensive? Is it because all the processes > have to enter the Barrier synchronisation before the next super step > starts? Yes, if your data is not distributed evenly or you don't have homogeneous servers it becomes more costly. It is also driven by Amdahl's Law, which simply says that your algorithms just can be as fast as the sequential part of the algorithm. If your algorithm has 99% of time in sequential methods (This is the case in canopy clustering) then you won't get a large speed boost when parallelizing the remaining 1%. And that is the reason why communication with a master task (or at least an algorithm design with a master task) isn't a good idea either, because you're then adding sequential parts and another sync. This makes the whole thing slow. But sometimes you can not avoid sequential code and a master task. [1] https://github.com/tdunning/knn [2] http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf Am 10. April 2012 12:52 schrieb Praveen Sripati <[email protected]>: > > It makes sense, since global sync is very expensive. > > For any iterative algorithm like k-means, the # of global syncs is > proportional the # of iterations. So, how does k-means fit in BSP? > > Also, why is global sync very expensive? Is it because all the processes > have to enter the Barrier synchronisation before the next super step > starts? > > Praveen > > On Tue, Apr 10, 2012 at 12:30 PM, Thomas Jungblut < > [email protected]> wrote: > > > There are algorithms that have very few supersteps, see the Matrix-Vector > > Multiplication in GSoC this year. > > It makes sense, since global sync is very expensive. > > > > However, Canopy clustering does not fit very well, since there is a > > parallel part and a sequencial part. > > So MapReduce is a good fit for canopy clustering. > > > > Am 7. April 2012 15:19 schrieb Praveen Sripati <[email protected] > >: > > > > > Hi, > > > > > > After Thomas implementation of K-Means (3) I was motivated to extend it > > > using the Canopy clustering. So, I started looking at the MR > > implementation > > > of Canopy (1) and (2). The MR implementation of Canopy clustering is > done > > > in two MR phases, first one to identify the canopies and second to > assign > > > canopies to the data points. I don't see much improvement when this is > > done > > > using BSP. Please correct me if I am wrong. > > > > > > Also, are there any algorithms which can implemented easily (for those > > who > > > are getting started with Hama/BSP like me) on Hama/BSP where we could > > also > > > see some performance improvements when compared to the MR > > implementation. I > > > have seen Mahout and there are many algorithms implemented in it and > > would > > > like to see something similar in Hama also. > > > > > > Thanks, > > > Praveen > > > > > > (1) - > > > > http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html > > > (2) - > > https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering > > > (3) - > > > > > > > > > http://codingwiththomas.blogspot.in/2011/12/k-means-clustering-with-bsp-intuition.html > > > > > > > > > > > -- > > Thomas Jungblut > > Berlin <[email protected]> > > > -- Thomas Jungblut Berlin <[email protected]>
