Re: Canopy Clustering on BSP

Praveen Sripati Tue, 10 Apr 2012 07:33:30 -0700

A single reducer gets the centroids from all the mappers and calculates the
centroids from them. So, the reducer shouldn't take much time.


On Tue, Apr 10, 2012 at 7:51 PM, Thomas Jungblut <
[email protected]> wrote:

> Don't think so. It took me 2 minutes to do the mapping and nearly 18h to
> reduce. ~80M points in a 30k vector space.
> Am 10.04.2012 16:18 schrieb "Praveen Sripati" <[email protected]>:
>
> > > If your algorithm has 99% of time in sequential methods (This is the
> case
> > in canopy clustering) then you won't get a large speed boost when
> > parallelizing the remaining 1%.
> >
> > Each mapper calculates the canopies for the subset of the input data and
> > sends them to a single reducer. The reducer produces the final canopies
> > using the thresholds (t1 and t2) and the distance measure. So, the canopy
> > generation is parallelized in the mappers and the reducer shouldn't also
> > take much time.
> >
> > (1)
> > http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
> > (2) https://cwiki.apache.org/MAHOUT/canopy-clustering.html
> >
> > Praveen
> >
> > On Tue, Apr 10, 2012 at 4:56 PM, Thomas Jungblut <
> > [email protected]> wrote:
> >
> > > KMeans is iterative, you can't do anything about it. Syncs are needed,
> > they
> > > are in mapreduce much more costly than in BSP.
> > > If you follow the mahout mailing list, Ted Dunning has found a paper
> for
> > a
> > > kmeans that is running in a single pass. So there is no iteration in
> > there.
> > > [1][2]
> > >
> > > Also, why is global sync very expensive? Is it because all the
> processes
> > > > have to enter the Barrier synchronisation before the next super step
> > > > starts?
> > >
> > >
> > > Yes, if your data is not distributed evenly or you don't have
> homogeneous
> > > servers it becomes more costly. It is also driven by Amdahl's Law,
> which
> > > simply says that your algorithms just can be as fast as the sequential
> > part
> > > of the algorithm. If your algorithm has 99% of time in sequential
> methods
> > > (This is the case in canopy clustering) then you won't get a large
> speed
> > > boost when parallelizing the remaining 1%.
> > > And that is the reason why communication with a master task (or at
> least
> > an
> > > algorithm design with a master task) isn't a good idea either, because
> > > you're then adding sequential parts and another sync. This makes the
> > whole
> > > thing slow.
> > >
> > > But sometimes you can not avoid sequential code and a master task.
> > >
> > > [1]  https://github.com/tdunning/knn
> > > [2]
> > http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf
> > >
> > > Am 10. April 2012 12:52 schrieb Praveen Sripati <
> > [email protected]
> > > >:
> > >
> > > > > It makes sense, since global sync is very expensive.
> > > >
> > > > For any iterative algorithm like k-means, the # of global syncs is
> > > > proportional the # of iterations. So, how does k-means fit in BSP?
> > > >
> > > > Also, why is global sync very expensive? Is it because all the
> > processes
> > > > have to enter the Barrier synchronisation before the next super step
> > > > starts?
> > > >
> > > > Praveen
> > > >
> > > > On Tue, Apr 10, 2012 at 12:30 PM, Thomas Jungblut <
> > > > [email protected]> wrote:
> > > >
> > > > > There are algorithms that have very few supersteps, see the
> > > Matrix-Vector
> > > > > Multiplication in GSoC this year.
> > > > > It makes sense, since global sync is very expensive.
> > > > >
> > > > > However, Canopy clustering does not fit very well, since there is a
> > > > > parallel part and a sequencial part.
> > > > > So MapReduce is a good fit for canopy clustering.
> > > > >
> > > > > Am 7. April 2012 15:19 schrieb Praveen Sripati <
> > > [email protected]
> > > > >:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > After Thomas implementation of K-Means (3) I was motivated to
> > extend
> > > it
> > > > > > using the Canopy clustering. So, I started looking at the MR
> > > > > implementation
> > > > > > of Canopy (1) and (2). The MR implementation of Canopy clustering
> > is
> > > > done
> > > > > > in two MR phases, first one to identify the canopies and second
> to
> > > > assign
> > > > > > canopies to the data points. I don't see much improvement when
> this
> > > is
> > > > > done
> > > > > > using BSP. Please correct me if I am wrong.
> > > > > >
> > > > > > Also, are there any algorithms which can implemented easily (for
> > > those
> > > > > who
> > > > > > are getting started with Hama/BSP like me) on Hama/BSP where we
> > could
> > > > > also
> > > > > > see some performance improvements when compared to the MR
> > > > > implementation. I
> > > > > > have seen Mahout and there are many algorithms implemented in it
> > and
> > > > > would
> > > > > > like to see something similar in Hama also.
> > > > > >
> > > > > > Thanks,
> > > > > > Praveen
> > > > > >
> > > > > > (1) -
> > > > > >
> > > >
> > http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
> > > > > > (2) -
> > > > >
> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
> > > > > > (3) -
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://codingwiththomas.blogspot.in/2011/12/k-means-clustering-with-bsp-intuition.html
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thomas Jungblut
> > > > > Berlin <[email protected]>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thomas Jungblut
> > > Berlin <[email protected]>
> > >
> >
>

Re: Canopy Clustering on BSP

Reply via email to