This does seem interesting (within the context of a quick skim). I have a few questions, however.
First, how does this compare in practice with k-means++ (which we still don't have)? Secondly, what about parallelism? Thirdly, would it be better to simply retrofit something like an all-reduce operation into our current k-means to avoid map-reduce iterations? On Sun, Jan 15, 2012 at 9:23 PM, Federico Castanedo <[email protected] > wrote: > Hi all, > > These days i've been looking to this paper: > "*Fast and Accurate *k*-means for Large Datasets",* recently presented in > NIPS'2011. > http://web.engr.oregonstate.edu/~shindler/papers/StreamingKMeans_soda11.pdf > > It seems an outstanding state-of-the-art approach to implement streaming > kmeans for very large datasets > and my feeling is that could be something really cool to have into Mahout. > > I've just made a quick Java implementation (without M/R capabilities) into > Mahout trunk code (based on Michael Shindler > C++ implementation), but still need more work to do (test that it works > correctly, improve some parts and cleaning code). > Let me know if you think this method may be something good to have into > Mahout. I would like to open a Jira ticket and > integrate this new issue with your help if there is enough interest. > > Bests, > Federico >
