The use of the term facilities is very confusing. I would prefer we talk about centroids or the Cluster data structure.
Essentially what fastKM does is to pre-cluster the data into a large set of centroids. This is plausibly done in parallel. Then fastKM does conventional k-means on this large set. One particular virtue is that a probabilistic method is used for adding new centroids. This is similar to, but simpler than, k-means++ and shares some of the virtues. On Mon, Jan 16, 2012 at 3:02 PM, Federico Castanedo <[email protected] > wrote: > Ted, thanks for your comments. > > One difference that i see with this technique (fastkm) and current kmeans > clustering implementation in Mahout: > at the end, fastkm provides the set of K points based on the cost > minimization of the k employed facilities (with k>>K) but current > o.a.m.c.kmeans provides a List of Cluster objects, that is the coordinates > of the K prototypes and the points belong to them....fastkm is conceived to > "train" only the prototypes of the clusters (forgetting the employed > d-dimenstional points)...which of course, could be use latter as a > classifier to new points (the nice relation between clustering and > classification). > > So i don't know how this different behaviour of clustering (batch vs > streaming) could be integrated in a framework such as mahout from the user > perspective. For some applications may be necessary to have the points > clustered along with the prototypes but others having just the sketch works > fine. > > > 2012/1/15 Ted Dunning <[email protected]> > > > Yes. This does seem interesting. > > > > One particular point is that the extraction of the sketch of the data set > > is an operation that is likely parallelizable and the sketch itself is > > probably interesting for many other algorithms. For instance, the sketch > > may be usable as a surrogate for the entire data set for knn algorithms. > > My idea that the sketch production is amenable to parallel treatment > stems > > from the observation that the union of sketches of parts of the data is > > probably a decent sketch. Parallel implementation would be trivial if > this > > is true. > > > > The performance relative to streamkm++ is not as clear as it might seem > if > > you look at the additional evaluations that are at the end of the paper. > > It looks like fastkmeans and streamkm++ each win sometimes and lose > > sometimes relative to each other. I don't see a clear win on either > side. > > > > On Sun, Jan 15, 2012 at 9:59 PM, Federico Castanedo < > > [email protected] > > > wrote: > > > > > Hi Ted, > > > > > > Oops, i just realized that the correct url of the paper is this one: > > > http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf > > > Have a look to the graph comparison at the end of the article, seems > that > > > outperformed streamkm++. > > > > > > Section 4.3 provides a good discussion against streamkm++. My guess is > > that > > > the improveness comes from the approximate nn technique based on random > > > projections (section 3 of the paper), which simplifies the most time > > > consuming part. > > > > > > I don't handle parallelism, since just have a quick implementation of > the > > > algorithm and some more work need to be done. > > > > > > The idea should be to have an implementation that just come over one > pass > > > through the points, afaik current kmeans implementation iterates till > > > convergence. > > > > > > Bests, > > > Federico > > > > > > 2012/1/15 Ted Dunning <[email protected]> > > > > > > > This does seem interesting (within the context of a quick skim). I > > have > > > a > > > > few questions, however. > > > > > > > > First, how does this compare in practice with k-means++ (which we > still > > > > don't have)? > > > > > > > > Secondly, what about parallelism? > > > > > > > > Thirdly, would it be better to simply retrofit something like an > > > all-reduce > > > > operation into our current k-means to avoid map-reduce iterations? > > > > > > > > On Sun, Jan 15, 2012 at 9:23 PM, Federico Castanedo < > > > > [email protected] > > > > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > These days i've been looking to this paper: > > > > > "*Fast and Accurate *k*-means for Large Datasets",* recently > > presented > > > in > > > > > NIPS'2011. > > > > > > > > > > > > > > > http://web.engr.oregonstate.edu/~shindler/papers/StreamingKMeans_soda11.pdf > > > > > > > > > > It seems an outstanding state-of-the-art approach to implement > > > streaming > > > > > kmeans for very large datasets > > > > > and my feeling is that could be something really cool to have into > > > > Mahout. > > > > > > > > > > I've just made a quick Java implementation (without M/R > capabilities) > > > > into > > > > > Mahout trunk code (based on Michael Shindler > > > > > C++ implementation), but still need more work to do (test that it > > works > > > > > correctly, improve some parts and cleaning code). > > > > > Let me know if you think this method may be something good to have > > into > > > > > Mahout. I would like to open a Jira ticket and > > > > > integrate this new issue with your help if there is enough > interest. > > > > > > > > > > Bests, > > > > > Federico > > > > > > > > > > > > > > >
