Re: Canopy Clustering on BSP

Thomas Jungblut Tue, 10 Apr 2012 07:22:04 -0700

Don't think so. It took me 2 minutes to do the mapping and nearly 18h to
reduce. ~80M points in a 30k vector space.
Am 10.04.2012 16:18 schrieb "Praveen Sripati" <[email protected]>:


> > If your algorithm has 99% of time in sequential methods (This is the case
> in canopy clustering) then you won't get a large speed boost when
> parallelizing the remaining 1%.
>
> Each mapper calculates the canopies for the subset of the input data and
> sends them to a single reducer. The reducer produces the final canopies
> using the thresholds (t1 and t2) and the distance measure. So, the canopy
> generation is parallelized in the mappers and the reducer shouldn't also
> take much time.
>
> (1)
> http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
> (2) https://cwiki.apache.org/MAHOUT/canopy-clustering.html
>
> Praveen
>
> On Tue, Apr 10, 2012 at 4:56 PM, Thomas Jungblut <
> [email protected]> wrote:
>
> > KMeans is iterative, you can't do anything about it. Syncs are needed,
> they
> > are in mapreduce much more costly than in BSP.
> > If you follow the mahout mailing list, Ted Dunning has found a paper for
> a
> > kmeans that is running in a single pass. So there is no iteration in
> there.
> > [1][2]
> >
> > Also, why is global sync very expensive? Is it because all the processes
> > > have to enter the Barrier synchronisation before the next super step
> > > starts?
> >
> >
> > Yes, if your data is not distributed evenly or you don't have homogeneous
> > servers it becomes more costly. It is also driven by Amdahl's Law, which
> > simply says that your algorithms just can be as fast as the sequential
> part
> > of the algorithm. If your algorithm has 99% of time in sequential methods
> > (This is the case in canopy clustering) then you won't get a large speed
> > boost when parallelizing the remaining 1%.
> > And that is the reason why communication with a master task (or at least
> an
> > algorithm design with a master task) isn't a good idea either, because
> > you're then adding sequential parts and another sync. This makes the
> whole
> > thing slow.
> >
> > But sometimes you can not avoid sequential code and a master task.
> >
> > [1]  https://github.com/tdunning/knn
> > [2]
> http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf
> >
> > Am 10. April 2012 12:52 schrieb Praveen Sripati <
> [email protected]
> > >:
> >
> > > > It makes sense, since global sync is very expensive.
> > >
> > > For any iterative algorithm like k-means, the # of global syncs is
> > > proportional the # of iterations. So, how does k-means fit in BSP?
> > >
> > > Also, why is global sync very expensive? Is it because all the
> processes
> > > have to enter the Barrier synchronisation before the next super step
> > > starts?
> > >
> > > Praveen
> > >
> > > On Tue, Apr 10, 2012 at 12:30 PM, Thomas Jungblut <
> > > [email protected]> wrote:
> > >
> > > > There are algorithms that have very few supersteps, see the
> > Matrix-Vector
> > > > Multiplication in GSoC this year.
> > > > It makes sense, since global sync is very expensive.
> > > >
> > > > However, Canopy clustering does not fit very well, since there is a
> > > > parallel part and a sequencial part.
> > > > So MapReduce is a good fit for canopy clustering.
> > > >
> > > > Am 7. April 2012 15:19 schrieb Praveen Sripati <
> > [email protected]
> > > >:
> > > >
> > > > > Hi,
> > > > >
> > > > > After Thomas implementation of K-Means (3) I was motivated to
> extend
> > it
> > > > > using the Canopy clustering. So, I started looking at the MR
> > > > implementation
> > > > > of Canopy (1) and (2). The MR implementation of Canopy clustering
> is
> > > done
> > > > > in two MR phases, first one to identify the canopies and second to
> > > assign
> > > > > canopies to the data points. I don't see much improvement when this
> > is
> > > > done
> > > > > using BSP. Please correct me if I am wrong.
> > > > >
> > > > > Also, are there any algorithms which can implemented easily (for
> > those
> > > > who
> > > > > are getting started with Hama/BSP like me) on Hama/BSP where we
> could
> > > > also
> > > > > see some performance improvements when compared to the MR
> > > > implementation. I
> > > > > have seen Mahout and there are many algorithms implemented in it
> and
> > > > would
> > > > > like to see something similar in Hama also.
> > > > >
> > > > > Thanks,
> > > > > Praveen
> > > > >
> > > > > (1) -
> > > > >
> > >
> http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
> > > > > (2) -
> > > > https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
> > > > > (3) -
> > > > >
> > > > >
> > > >
> > >
> >
> http://codingwiththomas.blogspot.in/2011/12/k-means-clustering-with-bsp-intuition.html
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thomas Jungblut
> > > > Berlin <[email protected]>
> > > >
> > >
> >
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <[email protected]>
> >
>

Re: Canopy Clustering on BSP

Reply via email to