Hi Ted, Oops, i just realized that the correct url of the paper is this one: http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf Have a look to the graph comparison at the end of the article, seems that outperformed streamkm++.
Section 4.3 provides a good discussion against streamkm++. My guess is that the improveness comes from the approximate nn technique based on random projections (section 3 of the paper), which simplifies the most time consuming part. I don't handle parallelism, since just have a quick implementation of the algorithm and some more work need to be done. The idea should be to have an implementation that just come over one pass through the points, afaik current kmeans implementation iterates till convergence. Bests, Federico 2012/1/15 Ted Dunning <[email protected]> > This does seem interesting (within the context of a quick skim). I have a > few questions, however. > > First, how does this compare in practice with k-means++ (which we still > don't have)? > > Secondly, what about parallelism? > > Thirdly, would it be better to simply retrofit something like an all-reduce > operation into our current k-means to avoid map-reduce iterations? > > On Sun, Jan 15, 2012 at 9:23 PM, Federico Castanedo < > [email protected] > > wrote: > > > Hi all, > > > > These days i've been looking to this paper: > > "*Fast and Accurate *k*-means for Large Datasets",* recently presented in > > NIPS'2011. > > > http://web.engr.oregonstate.edu/~shindler/papers/StreamingKMeans_soda11.pdf > > > > It seems an outstanding state-of-the-art approach to implement streaming > > kmeans for very large datasets > > and my feeling is that could be something really cool to have into > Mahout. > > > > I've just made a quick Java implementation (without M/R capabilities) > into > > Mahout trunk code (based on Michael Shindler > > C++ implementation), but still need more work to do (test that it works > > correctly, improve some parts and cleaning code). > > Let me know if you think this method may be something good to have into > > Mahout. I would like to open a Jira ticket and > > integrate this new issue with your help if there is enough interest. > > > > Bests, > > Federico > > >
