Re: streaming kmeans

Ted Dunning Sun, 15 Jan 2012 14:22:15 -0800

Yes.  This does seem interesting.

One particular point is that the extraction of the sketch of the data set
is an operation that is likely parallelizable and the sketch itself is
probably interesting for many other algorithms.  For instance, the sketch
may be usable as a surrogate for the entire data set for knn algorithms.
 My idea that the sketch production is amenable to parallel treatment stems
from the observation that the union of sketches of parts of the data is
probably a decent sketch.  Parallel implementation would be trivial if this
is true.


The performance relative to streamkm++ is not as clear as it might seem if
you look at the additional evaluations that are at the end of the paper.
 It looks like fastkmeans and streamkm++ each win sometimes and lose
sometimes relative to each other.  I don't see a clear win on either side.

On Sun, Jan 15, 2012 at 9:59 PM, Federico Castanedo <[email protected]
> wrote:

> Hi Ted,
>
> Oops, i just realized that the correct url of the paper is this one:
> http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf
> Have a look to the graph comparison at the end of the article, seems that
> outperformed streamkm++.
>
> Section 4.3 provides a good discussion against streamkm++. My guess is that
> the improveness comes from the approximate nn technique based on random
> projections (section 3 of the paper), which simplifies the most time
> consuming part.
>
> I don't handle parallelism, since just have a quick implementation of the
> algorithm and some more work need to be done.
>
> The idea should be to have an implementation that just come over one pass
> through the points, afaik current kmeans implementation iterates till
> convergence.
>
> Bests,
> Federico
>
> 2012/1/15 Ted Dunning <[email protected]>
>
> > This does seem interesting (within the context of a quick skim).  I have
> a
> > few questions, however.
> >
> > First, how does this compare in practice with k-means++ (which we still
> > don't have)?
> >
> > Secondly, what about parallelism?
> >
> > Thirdly, would it be better to simply retrofit something like an
> all-reduce
> > operation into our current k-means to avoid map-reduce iterations?
> >
> > On Sun, Jan 15, 2012 at 9:23 PM, Federico Castanedo <
> > [email protected]
> > > wrote:
> >
> > > Hi all,
> > >
> > > These days i've been looking to this paper:
> > > "*Fast and Accurate *k*-means for Large Datasets",* recently presented
> in
> > > NIPS'2011.
> > >
> >
> http://web.engr.oregonstate.edu/~shindler/papers/StreamingKMeans_soda11.pdf
> > >
> > > It seems an outstanding state-of-the-art approach to implement
> streaming
> > > kmeans for very large datasets
> > > and my feeling is that could be something really cool to have into
> > Mahout.
> > >
> > > I've just made a quick Java implementation (without M/R capabilities)
> > into
> > > Mahout trunk code (based on Michael Shindler
> > > C++ implementation), but still need more work to do (test that it works
> > > correctly, improve some parts and cleaning code).
> > > Let me know if you think this method may be something good to have into
> > > Mahout. I would like to open a Jira ticket and
> > > integrate this new issue with your help if there is enough interest.
> > >
> > > Bests,
> > > Federico
> > >
> >
>

Re: streaming kmeans

Reply via email to