Re: KMeans|| opinions

Ted Dunning Fri, 04 Apr 2014 04:28:31 -0700

Pipe-pipe is an initialization algorithm which (roughly) parallelizes
something like the ++ algorithm and requires logarithmically many parallel
passes over the data as opposed to the k passes required for ++.


DP-means itself is a variant of k-means where the value of k is not fixed.
 Instead, whenever a data-point is further than a threshold \lambda from
the nearest centroid, it is used to form a new centroid.  Different choices
of \lambda result in different numbers of clusters.  I don't know how well
DP-means by itself would work as an initialization pass.

Streaming k-means uses a variant of dp-means to get a sketch of the data.
 This sketch has many more centroids than the final desired number of
clusters k.  As such, we don't have to worry much about the quality of the
initialization of this first dp-means pass.  The sketch is then clustered
using something like k-means++ and ball k-means iteration.

As I see it, having a good streaming k-means block should make any
clustering algorithm better as long as the sketch is much smaller than the
original data.

The second phase of clustering after streaming k-means can benefit from any
technique that makes clustering better.  The caveat is just that streaming
k-means is likely to produce a small enough sketch that parallel clustering
may not help very much.  Exactly where that trade-off applies depends a lot
on the parallel framework available. Spark and h2o have much lower task
initiation costs so they will definitely be able to get parallel speedup on
much smaller datasets.




On Fri, Apr 4, 2014 at 11:29 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> is there any opinion whether one of pipe-pipe (||), dp-means
> initializations has bona-fide advantages  one over another?
>
>
> On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> > > On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <ted.dunn...@gmail.com>
> > > wrote:
> > >
> > > > - Have you considered sketch based algorithms?
> > > >
> > >
> > > can you give me a reference. at this point i am just  contemplating
> more
> > or
> > > less shameless port of what they've done in mllib.
> > >
> >
> > Here is the reference I used:
> >
> >
> >
> http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets
> >
> > *A quick summary:*
> >
> > - a single pass over the data with a sort of dp-means [1] algorithm
> using a
> > very fast approximate centroid search can give you k log N centroids.
>  This
> > is the sketch.
> >
> > - clustering these centroids as weighted values gives a provably probably
> > good clustering of the original data for well clusterable data.  For real
> > data, it seems to work exactly as advertised.
> >
> > - clustering the sketch using ball k-means with clever initialization is
> > important.  Good initialization is very expensive in terms of number of
> > points so using a sketch is a really good idea.
> >
> > - the proofs all depend on Euclidean distance.
> >
> > - the threshold can start very small (what small means is determined
> > empirically).  Each time we wind up with too many centroids, recursively
> > clustering the centroids and possibly increasing the threshold will keep
> > the number reasonably bounded.
> >
> > *Some notes from practical experience:*
> >
> > - moderate levels of parallelism are easy since sketches can be merged
> > using set union.  You may want to recursively cluster the centroids at
> this
> > point if you have too many.  This is very nice application of map-reduce.
> >
> > - high degrees of parallelism require multiple levels of merge/collapse
> > since otherwise you wind up with a sketch nearly as large as the original
> > data.  If you have m parallel clusterings then m k log (N/m) can be
> large.
> >  Take a billion points and m = 1000 and k = 10,000.  The size of the
> > parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m =
> 100, k
> > = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.
> >
> > - ball k-means on highly clusterable data often uses a cutoff for
> centroid
> > computation of 0.5 x distance to nearest.  I find that with real data
> that
> > 1 x distance or even larger to nearest is much better.  The good effects
> > are still mostly there, but you don't need wonderful data to succeed
> >
> > - the ball k-means algorithm that we have in Mahout is pretty high
> quality,
> > but could use a bit of triangle (Elkan [2] ) speedups.
> >
> > *References*
> >
> > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> > [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf
> >
> >
> > >
> > > >
> > > > - It can be important to use optimizations in the search for nearest
> > > > centroid.  Consider triangle optimizations.
> > > >
> > > > - Do you mean "parallel" when you type || or is there another meaning
> > > > there?
> > > >
> > >
> > > No, i mean method called "kmeans||". It's unfortunate name since I
> really
> > > don't know how to make google to search for it.
> > >
> > > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> > >
> > > >
> > > > - When you say ++ initialization, many people get this wrong and
> assume
> > > > that you mean pick the furthest point.  Getting really good
> > > initialization
> > > > is fairly difficult and typically requires more time than the actual
> > > > clustering.  This is one of the key benefits of sketch based methods.
> > > >
> > > > - Most algorithms require multiple restarts.  At higher dimension the
> > > > number of restarts required becomes very large.  An ideal
> > implementation
> > > > does parallel sketch extraction followed by parallel ball k-means for
> > > > restarts.
> > > >
> > > >
> > > >
> > > > On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> > > > wrote:
> > > >
> > > > > Considering porting implementation [1] and paper for KMeans || for
> > > > > Bindings.
> > > > >
> > > > > This seems like another method to map fairly nicely.
> > > > >
> > > > > The problem I am contemplating is ||-initialization, and in
> > particular,
> > > > > centroid storage. That particular implementation assumes centroids
> > > could
> > > > be
> > > > > kept in memory in front.
> > > > >
> > > > > (1) Question is, is it a dangerous idea. It doesn't seem like it
> > > > > particularly is, since unlikely people would want more k>1e+6.
> > Another
> > > > > thing, centers seem to be passed in via closure attribute (i.e.
> > > > > java-serialized array-backed matrix).However, with Bindings it is
> > quite
> > > > > possible to keep centers at the back as a matrix.
> > > > >
> > > > > (2) obviously, LLoyd iterations are not terribly accurate. || and
> ++
> > > > > versions mostly speed things up. Is there any better-than-LLoyd
> > > accuracy
> > > > > preference?
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
> > > > >
> > > >
> > >
> >
>

Re: KMeans|| opinions

Reply via email to