is there any opinion whether one of pipe-pipe (||), dp-means
initializations has bona-fide advantages  one over another?


On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning <[email protected]> wrote:

> On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <[email protected]>
> > wrote:
> >
> > > - Have you considered sketch based algorithms?
> > >
> >
> > can you give me a reference. at this point i am just  contemplating more
> or
> > less shameless port of what they've done in mllib.
> >
>
> Here is the reference I used:
>
>
> http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets
>
> *A quick summary:*
>
> - a single pass over the data with a sort of dp-means [1] algorithm using a
> very fast approximate centroid search can give you k log N centroids.  This
> is the sketch.
>
> - clustering these centroids as weighted values gives a provably probably
> good clustering of the original data for well clusterable data.  For real
> data, it seems to work exactly as advertised.
>
> - clustering the sketch using ball k-means with clever initialization is
> important.  Good initialization is very expensive in terms of number of
> points so using a sketch is a really good idea.
>
> - the proofs all depend on Euclidean distance.
>
> - the threshold can start very small (what small means is determined
> empirically).  Each time we wind up with too many centroids, recursively
> clustering the centroids and possibly increasing the threshold will keep
> the number reasonably bounded.
>
> *Some notes from practical experience:*
>
> - moderate levels of parallelism are easy since sketches can be merged
> using set union.  You may want to recursively cluster the centroids at this
> point if you have too many.  This is very nice application of map-reduce.
>
> - high degrees of parallelism require multiple levels of merge/collapse
> since otherwise you wind up with a sketch nearly as large as the original
> data.  If you have m parallel clusterings then m k log (N/m) can be large.
>  Take a billion points and m = 1000 and k = 10,000.  The size of the
> parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m = 100, k
> = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.
>
> - ball k-means on highly clusterable data often uses a cutoff for centroid
> computation of 0.5 x distance to nearest.  I find that with real data that
> 1 x distance or even larger to nearest is much better.  The good effects
> are still mostly there, but you don't need wonderful data to succeed
>
> - the ball k-means algorithm that we have in Mahout is pretty high quality,
> but could use a bit of triangle (Elkan [2] ) speedups.
>
> *References*
>
> [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf
>
>
> >
> > >
> > > - It can be important to use optimizations in the search for nearest
> > > centroid.  Consider triangle optimizations.
> > >
> > > - Do you mean "parallel" when you type || or is there another meaning
> > > there?
> > >
> >
> > No, i mean method called "kmeans||". It's unfortunate name since I really
> > don't know how to make google to search for it.
> >
> > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> >
> > >
> > > - When you say ++ initialization, many people get this wrong and assume
> > > that you mean pick the furthest point.  Getting really good
> > initialization
> > > is fairly difficult and typically requires more time than the actual
> > > clustering.  This is one of the key benefits of sketch based methods.
> > >
> > > - Most algorithms require multiple restarts.  At higher dimension the
> > > number of restarts required becomes very large.  An ideal
> implementation
> > > does parallel sketch extraction followed by parallel ball k-means for
> > > restarts.
> > >
> > >
> > >
> > > On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <[email protected]>
> > > wrote:
> > >
> > > > Considering porting implementation [1] and paper for KMeans || for
> > > > Bindings.
> > > >
> > > > This seems like another method to map fairly nicely.
> > > >
> > > > The problem I am contemplating is ||-initialization, and in
> particular,
> > > > centroid storage. That particular implementation assumes centroids
> > could
> > > be
> > > > kept in memory in front.
> > > >
> > > > (1) Question is, is it a dangerous idea. It doesn't seem like it
> > > > particularly is, since unlikely people would want more k>1e+6.
> Another
> > > > thing, centers seem to be passed in via closure attribute (i.e.
> > > > java-serialized array-backed matrix).However, with Bindings it is
> quite
> > > > possible to keep centers at the back as a matrix.
> > > >
> > > > (2) obviously, LLoyd iterations are not terribly accurate. || and ++
> > > > versions mostly speed things up. Is there any better-than-LLoyd
> > accuracy
> > > > preference?
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
> > > >
> > >
> >
>

Reply via email to