Re: Canopy Clustering on BSP

Praveen Sripati Tue, 10 Apr 2012 07:18:36 -0700

> If your algorithm has 99% of time in sequential methods (This is the case
in canopy clustering) then you won't get a large speed boost when
parallelizing the remaining 1%.


Each mapper calculates the canopies for the subset of the input data and
sends them to a single reducer. The reducer produces the final canopies
using the thresholds (t1 and t2) and the distance measure. So, the canopy
generation is parallelized in the mappers and the reducer shouldn't also
take much time.

(1) http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
(2) https://cwiki.apache.org/MAHOUT/canopy-clustering.html

Praveen

On Tue, Apr 10, 2012 at 4:56 PM, Thomas Jungblut <
[email protected]> wrote:

> KMeans is iterative, you can't do anything about it. Syncs are needed, they
> are in mapreduce much more costly than in BSP.
> If you follow the mahout mailing list, Ted Dunning has found a paper for a
> kmeans that is running in a single pass. So there is no iteration in there.
> [1][2]
>
> Also, why is global sync very expensive? Is it because all the processes
> > have to enter the Barrier synchronisation before the next super step
> > starts?
>
>
> Yes, if your data is not distributed evenly or you don't have homogeneous
> servers it becomes more costly. It is also driven by Amdahl's Law, which
> simply says that your algorithms just can be as fast as the sequential part
> of the algorithm. If your algorithm has 99% of time in sequential methods
> (This is the case in canopy clustering) then you won't get a large speed
> boost when parallelizing the remaining 1%.
> And that is the reason why communication with a master task (or at least an
> algorithm design with a master task) isn't a good idea either, because
> you're then adding sequential parts and another sync. This makes the whole
> thing slow.
>
> But sometimes you can not avoid sequential code and a master task.
>
> [1]  https://github.com/tdunning/knn
> [2] http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf
>
> Am 10. April 2012 12:52 schrieb Praveen Sripati <[email protected]
> >:
>
> > > It makes sense, since global sync is very expensive.
> >
> > For any iterative algorithm like k-means, the # of global syncs is
> > proportional the # of iterations. So, how does k-means fit in BSP?
> >
> > Also, why is global sync very expensive? Is it because all the processes
> > have to enter the Barrier synchronisation before the next super step
> > starts?
> >
> > Praveen
> >
> > On Tue, Apr 10, 2012 at 12:30 PM, Thomas Jungblut <
> > [email protected]> wrote:
> >
> > > There are algorithms that have very few supersteps, see the
> Matrix-Vector
> > > Multiplication in GSoC this year.
> > > It makes sense, since global sync is very expensive.
> > >
> > > However, Canopy clustering does not fit very well, since there is a
> > > parallel part and a sequencial part.
> > > So MapReduce is a good fit for canopy clustering.
> > >
> > > Am 7. April 2012 15:19 schrieb Praveen Sripati <
> [email protected]
> > >:
> > >
> > > > Hi,
> > > >
> > > > After Thomas implementation of K-Means (3) I was motivated to extend
> it
> > > > using the Canopy clustering. So, I started looking at the MR
> > > implementation
> > > > of Canopy (1) and (2). The MR implementation of Canopy clustering is
> > done
> > > > in two MR phases, first one to identify the canopies and second to
> > assign
> > > > canopies to the data points. I don't see much improvement when this
> is
> > > done
> > > > using BSP. Please correct me if I am wrong.
> > > >
> > > > Also, are there any algorithms which can implemented easily (for
> those
> > > who
> > > > are getting started with Hama/BSP like me) on Hama/BSP where we could
> > > also
> > > > see some performance improvements when compared to the MR
> > > implementation. I
> > > > have seen Mahout and there are many algorithms implemented in it and
> > > would
> > > > like to see something similar in Hama also.
> > > >
> > > > Thanks,
> > > > Praveen
> > > >
> > > > (1) -
> > > >
> > http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
> > > > (2) -
> > > https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
> > > > (3) -
> > > >
> > > >
> > >
> >
> http://codingwiththomas.blogspot.in/2011/12/k-means-clustering-with-bsp-intuition.html
> > > >
> > >
> > >
> > >
> > > --
> > > Thomas Jungblut
> > > Berlin <[email protected]>
> > >
> >
>
>
>
> --
> Thomas Jungblut
> Berlin <[email protected]>
>

Re: Canopy Clustering on BSP

Reply via email to