Don't think so. It took me 2 minutes to do the mapping and nearly 18h to reduce. ~80M points in a 30k vector space. Am 10.04.2012 16:18 schrieb "Praveen Sripati" <[email protected]>:
> > If your algorithm has 99% of time in sequential methods (This is the case > in canopy clustering) then you won't get a large speed boost when > parallelizing the remaining 1%. > > Each mapper calculates the canopies for the subset of the input data and > sends them to a single reducer. The reducer produces the final canopies > using the thresholds (t1 and t2) and the distance measure. So, the canopy > generation is parallelized in the mappers and the reducer shouldn't also > take much time. > > (1) > http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html > (2) https://cwiki.apache.org/MAHOUT/canopy-clustering.html > > Praveen > > On Tue, Apr 10, 2012 at 4:56 PM, Thomas Jungblut < > [email protected]> wrote: > > > KMeans is iterative, you can't do anything about it. Syncs are needed, > they > > are in mapreduce much more costly than in BSP. > > If you follow the mahout mailing list, Ted Dunning has found a paper for > a > > kmeans that is running in a single pass. So there is no iteration in > there. > > [1][2] > > > > Also, why is global sync very expensive? Is it because all the processes > > > have to enter the Barrier synchronisation before the next super step > > > starts? > > > > > > Yes, if your data is not distributed evenly or you don't have homogeneous > > servers it becomes more costly. It is also driven by Amdahl's Law, which > > simply says that your algorithms just can be as fast as the sequential > part > > of the algorithm. If your algorithm has 99% of time in sequential methods > > (This is the case in canopy clustering) then you won't get a large speed > > boost when parallelizing the remaining 1%. > > And that is the reason why communication with a master task (or at least > an > > algorithm design with a master task) isn't a good idea either, because > > you're then adding sequential parts and another sync. This makes the > whole > > thing slow. > > > > But sometimes you can not avoid sequential code and a master task. > > > > [1] https://github.com/tdunning/knn > > [2] > http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf > > > > Am 10. April 2012 12:52 schrieb Praveen Sripati < > [email protected] > > >: > > > > > > It makes sense, since global sync is very expensive. > > > > > > For any iterative algorithm like k-means, the # of global syncs is > > > proportional the # of iterations. So, how does k-means fit in BSP? > > > > > > Also, why is global sync very expensive? Is it because all the > processes > > > have to enter the Barrier synchronisation before the next super step > > > starts? > > > > > > Praveen > > > > > > On Tue, Apr 10, 2012 at 12:30 PM, Thomas Jungblut < > > > [email protected]> wrote: > > > > > > > There are algorithms that have very few supersteps, see the > > Matrix-Vector > > > > Multiplication in GSoC this year. > > > > It makes sense, since global sync is very expensive. > > > > > > > > However, Canopy clustering does not fit very well, since there is a > > > > parallel part and a sequencial part. > > > > So MapReduce is a good fit for canopy clustering. > > > > > > > > Am 7. April 2012 15:19 schrieb Praveen Sripati < > > [email protected] > > > >: > > > > > > > > > Hi, > > > > > > > > > > After Thomas implementation of K-Means (3) I was motivated to > extend > > it > > > > > using the Canopy clustering. So, I started looking at the MR > > > > implementation > > > > > of Canopy (1) and (2). The MR implementation of Canopy clustering > is > > > done > > > > > in two MR phases, first one to identify the canopies and second to > > > assign > > > > > canopies to the data points. I don't see much improvement when this > > is > > > > done > > > > > using BSP. Please correct me if I am wrong. > > > > > > > > > > Also, are there any algorithms which can implemented easily (for > > those > > > > who > > > > > are getting started with Hama/BSP like me) on Hama/BSP where we > could > > > > also > > > > > see some performance improvements when compared to the MR > > > > implementation. I > > > > > have seen Mahout and there are many algorithms implemented in it > and > > > > would > > > > > like to see something similar in Hama also. > > > > > > > > > > Thanks, > > > > > Praveen > > > > > > > > > > (1) - > > > > > > > > > http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html > > > > > (2) - > > > > https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering > > > > > (3) - > > > > > > > > > > > > > > > > > > > > http://codingwiththomas.blogspot.in/2011/12/k-means-clustering-with-bsp-intuition.html > > > > > > > > > > > > > > > > > > > > > -- > > > > Thomas Jungblut > > > > Berlin <[email protected]> > > > > > > > > > > > > > > > -- > > Thomas Jungblut > > Berlin <[email protected]> > > >
