Re: streaming kmeans

Ted Dunning Mon, 16 Jan 2012 07:55:39 -0800

The use of the term facilities is very confusing.  I would prefer we talk
about centroids or the Cluster data structure.


Essentially what fastKM does is to pre-cluster the data into a large set of
centroids.  This is plausibly done in parallel.  Then fastKM does
conventional k-means on this large set.

One particular virtue is that a probabilistic method is used for adding new
centroids.  This is similar to, but simpler than, k-means++ and shares some
of the virtues.

On Mon, Jan 16, 2012 at 3:02 PM, Federico Castanedo <[email protected]
> wrote:

> Ted, thanks for your comments.
>
> One difference that i see with this technique (fastkm) and current kmeans
> clustering implementation in Mahout:
> at the end, fastkm provides the set of K points based on the cost
> minimization of the k employed facilities (with k>>K) but current
> o.a.m.c.kmeans provides a List of Cluster objects, that is the coordinates
> of the K prototypes and the points belong to them....fastkm is conceived to
> "train" only the prototypes of the clusters (forgetting the employed
> d-dimenstional points)...which of course, could be use latter as a
> classifier to new points (the nice relation between clustering and
> classification).
>
> So i don't know how this different behaviour of clustering (batch vs
> streaming) could be integrated in a framework such as mahout from the user
> perspective. For some applications may be necessary to have the points
> clustered along with the prototypes but others having just the sketch works
> fine.
>
>
> 2012/1/15 Ted Dunning <[email protected]>
>
> > Yes.  This does seem interesting.
> >
> > One particular point is that the extraction of the sketch of the data set
> > is an operation that is likely parallelizable and the sketch itself is
> > probably interesting for many other algorithms.  For instance, the sketch
> > may be usable as a surrogate for the entire data set for knn algorithms.
> >  My idea that the sketch production is amenable to parallel treatment
> stems
> > from the observation that the union of sketches of parts of the data is
> > probably a decent sketch.  Parallel implementation would be trivial if
> this
> > is true.
> >
> > The performance relative to streamkm++ is not as clear as it might seem
> if
> > you look at the additional evaluations that are at the end of the paper.
> >  It looks like fastkmeans and streamkm++ each win sometimes and lose
> > sometimes relative to each other.  I don't see a clear win on either
> side.
> >
> > On Sun, Jan 15, 2012 at 9:59 PM, Federico Castanedo <
> > [email protected]
> > > wrote:
> >
> > > Hi Ted,
> > >
> > > Oops, i just realized that the correct url of the paper is this one:
> > > http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf
> > > Have a look to the graph comparison at the end of the article, seems
> that
> > > outperformed streamkm++.
> > >
> > > Section 4.3 provides a good discussion against streamkm++. My guess is
> > that
> > > the improveness comes from the approximate nn technique based on random
> > > projections (section 3 of the paper), which simplifies the most time
> > > consuming part.
> > >
> > > I don't handle parallelism, since just have a quick implementation of
> the
> > > algorithm and some more work need to be done.
> > >
> > > The idea should be to have an implementation that just come over one
> pass
> > > through the points, afaik current kmeans implementation iterates till
> > > convergence.
> > >
> > > Bests,
> > > Federico
> > >
> > > 2012/1/15 Ted Dunning <[email protected]>
> > >
> > > > This does seem interesting (within the context of a quick skim).  I
> > have
> > > a
> > > > few questions, however.
> > > >
> > > > First, how does this compare in practice with k-means++ (which we
> still
> > > > don't have)?
> > > >
> > > > Secondly, what about parallelism?
> > > >
> > > > Thirdly, would it be better to simply retrofit something like an
> > > all-reduce
> > > > operation into our current k-means to avoid map-reduce iterations?
> > > >
> > > > On Sun, Jan 15, 2012 at 9:23 PM, Federico Castanedo <
> > > > [email protected]
> > > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > These days i've been looking to this paper:
> > > > > "*Fast and Accurate *k*-means for Large Datasets",* recently
> > presented
> > > in
> > > > > NIPS'2011.
> > > > >
> > > >
> > >
> >
> http://web.engr.oregonstate.edu/~shindler/papers/StreamingKMeans_soda11.pdf
> > > > >
> > > > > It seems an outstanding state-of-the-art approach to implement
> > > streaming
> > > > > kmeans for very large datasets
> > > > > and my feeling is that could be something really cool to have into
> > > > Mahout.
> > > > >
> > > > > I've just made a quick Java implementation (without M/R
> capabilities)
> > > > into
> > > > > Mahout trunk code (based on Michael Shindler
> > > > > C++ implementation), but still need more work to do (test that it
> > works
> > > > > correctly, improve some parts and cleaning code).
> > > > > Let me know if you think this method may be something good to have
> > into
> > > > > Mahout. I would like to open a Jira ticket and
> > > > > integrate this new issue with your help if there is enough
> interest.
> > > > >
> > > > > Bests,
> > > > > Federico
> > > > >
> > > >
> > >
> >
>

Re: streaming kmeans

Reply via email to