Hi Ted,

Oops, i just realized that the correct url of the paper is this one:
http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf
Have a look to the graph comparison at the end of the article, seems that
outperformed streamkm++.

Section 4.3 provides a good discussion against streamkm++. My guess is that
the improveness comes from the approximate nn technique based on random
projections (section 3 of the paper), which simplifies the most time
consuming part.

I don't handle parallelism, since just have a quick implementation of the
algorithm and some more work need to be done.

The idea should be to have an implementation that just come over one pass
through the points, afaik current kmeans implementation iterates till
convergence.

Bests,
Federico

2012/1/15 Ted Dunning <[email protected]>

> This does seem interesting (within the context of a quick skim).  I have a
> few questions, however.
>
> First, how does this compare in practice with k-means++ (which we still
> don't have)?
>
> Secondly, what about parallelism?
>
> Thirdly, would it be better to simply retrofit something like an all-reduce
> operation into our current k-means to avoid map-reduce iterations?
>
> On Sun, Jan 15, 2012 at 9:23 PM, Federico Castanedo <
> [email protected]
> > wrote:
>
> > Hi all,
> >
> > These days i've been looking to this paper:
> > "*Fast and Accurate *k*-means for Large Datasets",* recently presented in
> > NIPS'2011.
> >
> http://web.engr.oregonstate.edu/~shindler/papers/StreamingKMeans_soda11.pdf
> >
> > It seems an outstanding state-of-the-art approach to implement streaming
> > kmeans for very large datasets
> > and my feeling is that could be something really cool to have into
> Mahout.
> >
> > I've just made a quick Java implementation (without M/R capabilities)
> into
> > Mahout trunk code (based on Michael Shindler
> > C++ implementation), but still need more work to do (test that it works
> > correctly, improve some parts and cleaning code).
> > Let me know if you think this method may be something good to have into
> > Mahout. I would like to open a Jira ticket and
> > integrate this new issue with your help if there is enough interest.
> >
> > Bests,
> > Federico
> >
>

Reply via email to