Re: KMeans|| opinions

Dmitriy Lyubimov Fri, 04 Apr 2014 13:21:38 -0700

On Fri, Apr 4, 2014 at 8:44 AM, Dmitriy Lyubimov <[email protected]> wrote:


>
>
>
> On Fri, Apr 4, 2014 at 4:26 AM, Ted Dunning <[email protected]> wrote:
>
>> Pipe-pipe is an initialization algorithm which (roughly) parallelizes
>> something like the ++ algorithm and requires logarithmically many parallel
>> passes over the data as opposed to the k passes required for ++.
>>
>
> i'll re-read it again. The way i understood the paper (and the code),
> pipe-pipe  actually uniformly subsamples data in each pass to bring in a
> limited number of points in each pass which is O(k) not O(n). This is not
> the same as going over each point computing distance-based probability
> metrics for them, if that's what you imply by going for ++ comparison.
> DP-means would have to do subsampling as well.
>
Retracting that. They do full passes over data with full point cost
recomputations each time.

>
>
>> DP-means itself is a variant of k-means where the value of k is not fixed.
>>  Instead, whenever a data-point is further than a threshold \lambda from
>> the nearest centroid, it is used to form a new centroid.  Different
>> choices
>> of \lambda result in different numbers of clusters.  I don't know how well
>> DP-means by itself would work as an initialization pass.
>>
>> Streaming k-means uses a variant of dp-means to get a sketch of the data.
>>  This sketch has many more centroids than the final desired number of
>> clusters k.  As such, we don't have to worry much about the quality of the
>> initialization of this first dp-means pass.  The sketch is then clustered
>> using something like k-means++ and ball k-means iteration.
>>
>> As I see it, having a good streaming k-means block should make any
>> clustering algorithm better as long as the sketch is much smaller than the
>> original data.
>>
>> The second phase of clustering after streaming k-means can benefit from
>> any
>> technique that makes clustering better.  The caveat is just that streaming
>> k-means is likely to produce a small enough sketch that parallel
>> clustering
>> may not help very much.  Exactly where that trade-off applies depends a
>> lot
>> on the parallel framework available. Spark and h2o have much lower task
>> initiation costs so they will definitely be able to get parallel speedup
>> on
>> much smaller datasets.
>>
>>
>>
>>
>> On Fri, Apr 4, 2014 at 11:29 AM, Dmitriy Lyubimov <[email protected]>
>> wrote:
>>
>> > is there any opinion whether one of pipe-pipe (||), dp-means
>> > initializations has bona-fide advantages  one over another?
>> >
>> >
>> > On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning <[email protected]>
>> > wrote:
>> >
>> > > On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov <[email protected]>
>> > > wrote:
>> > >
>> > > > On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <[email protected]
>> >
>> > > > wrote:
>> > > >
>> > > > > - Have you considered sketch based algorithms?
>> > > > >
>> > > >
>> > > > can you give me a reference. at this point i am just  contemplating
>> > more
>> > > or
>> > > > less shameless port of what they've done in mllib.
>> > > >
>> > >
>> > > Here is the reference I used:
>> > >
>> > >
>> > >
>> >
>> http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets
>> > >
>> > > *A quick summary:*
>> > >
>> > > - a single pass over the data with a sort of dp-means [1] algorithm
>> > using a
>> > > very fast approximate centroid search can give you k log N centroids.
>> >  This
>> > > is the sketch.
>> > >
>> > > - clustering these centroids as weighted values gives a provably
>> probably
>> > > good clustering of the original data for well clusterable data.  For
>> real
>> > > data, it seems to work exactly as advertised.
>> > >
>> > > - clustering the sketch using ball k-means with clever initialization
>> is
>> > > important.  Good initialization is very expensive in terms of number
>> of
>> > > points so using a sketch is a really good idea.
>> > >
>> > > - the proofs all depend on Euclidean distance.
>> > >
>> > > - the threshold can start very small (what small means is determined
>> > > empirically).  Each time we wind up with too many centroids,
>> recursively
>> > > clustering the centroids and possibly increasing the threshold will
>> keep
>> > > the number reasonably bounded.
>> > >
>> > > *Some notes from practical experience:*
>> > >
>> > > - moderate levels of parallelism are easy since sketches can be merged
>> > > using set union.  You may want to recursively cluster the centroids at
>> > this
>> > > point if you have too many.  This is very nice application of
>> map-reduce.
>> > >
>> > > - high degrees of parallelism require multiple levels of
>> merge/collapse
>> > > since otherwise you wind up with a sketch nearly as large as the
>> original
>> > > data.  If you have m parallel clusterings then m k log (N/m) can be
>> > large.
>> > >  Take a billion points and m = 1000 and k = 10,000.  The size of the
>> > > parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m =
>> > 100, k
>> > > = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.
>> > >
>> > > - ball k-means on highly clusterable data often uses a cutoff for
>> > centroid
>> > > computation of 0.5 x distance to nearest.  I find that with real data
>> > that
>> > > 1 x distance or even larger to nearest is much better.  The good
>> effects
>> > > are still mostly there, but you don't need wonderful data to succeed
>> > >
>> > > - the ball k-means algorithm that we have in Mahout is pretty high
>> > quality,
>> > > but could use a bit of triangle (Elkan [2] ) speedups.
>> > >
>> > > *References*
>> > >
>> > > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
>> > > [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf
>> > >
>> > >
>> > > >
>> > > > >
>> > > > > - It can be important to use optimizations in the search for
>> nearest
>> > > > > centroid.  Consider triangle optimizations.
>> > > > >
>> > > > > - Do you mean "parallel" when you type || or is there another
>> meaning
>> > > > > there?
>> > > > >
>> > > >
>> > > > No, i mean method called "kmeans||". It's unfortunate name since I
>> > really
>> > > > don't know how to make google to search for it.
>> > > >
>> > > > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
>> > > >
>> > > > >
>> > > > > - When you say ++ initialization, many people get this wrong and
>> > assume
>> > > > > that you mean pick the furthest point.  Getting really good
>> > > > initialization
>> > > > > is fairly difficult and typically requires more time than the
>> actual
>> > > > > clustering.  This is one of the key benefits of sketch based
>> methods.
>> > > > >
>> > > > > - Most algorithms require multiple restarts.  At higher dimension
>> the
>> > > > > number of restarts required becomes very large.  An ideal
>> > > implementation
>> > > > > does parallel sketch extraction followed by parallel ball k-means
>> for
>> > > > > restarts.
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <
>> [email protected]>
>> > > > > wrote:
>> > > > >
>> > > > > > Considering porting implementation [1] and paper for KMeans ||
>> for
>> > > > > > Bindings.
>> > > > > >
>> > > > > > This seems like another method to map fairly nicely.
>> > > > > >
>> > > > > > The problem I am contemplating is ||-initialization, and in
>> > > particular,
>> > > > > > centroid storage. That particular implementation assumes
>> centroids
>> > > > could
>> > > > > be
>> > > > > > kept in memory in front.
>> > > > > >
>> > > > > > (1) Question is, is it a dangerous idea. It doesn't seem like it
>> > > > > > particularly is, since unlikely people would want more k>1e+6.
>> > > Another
>> > > > > > thing, centers seem to be passed in via closure attribute (i.e.
>> > > > > > java-serialized array-backed matrix).However, with Bindings it
>> is
>> > > quite
>> > > > > > possible to keep centers at the back as a matrix.
>> > > > > >
>> > > > > > (2) obviously, LLoyd iterations are not terribly accurate. ||
>> and
>> > ++
>> > > > > > versions mostly speed things up. Is there any better-than-LLoyd
>> > > > accuracy
>> > > > > > preference?
>> > > > > >
>> > > > > >
>> > > > > > [1]
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: KMeans|| opinions

Reply via email to