On Fri, Apr 4, 2014 at 8:44 AM, Dmitriy Lyubimov <[email protected]> wrote:
> > > > On Fri, Apr 4, 2014 at 4:26 AM, Ted Dunning <[email protected]> wrote: > >> Pipe-pipe is an initialization algorithm which (roughly) parallelizes >> something like the ++ algorithm and requires logarithmically many parallel >> passes over the data as opposed to the k passes required for ++. >> > > i'll re-read it again. The way i understood the paper (and the code), > pipe-pipe actually uniformly subsamples data in each pass to bring in a > limited number of points in each pass which is O(k) not O(n). This is not > the same as going over each point computing distance-based probability > metrics for them, if that's what you imply by going for ++ comparison. > DP-means would have to do subsampling as well. > Retracting that. They do full passes over data with full point cost recomputations each time. > > >> DP-means itself is a variant of k-means where the value of k is not fixed. >> Instead, whenever a data-point is further than a threshold \lambda from >> the nearest centroid, it is used to form a new centroid. Different >> choices >> of \lambda result in different numbers of clusters. I don't know how well >> DP-means by itself would work as an initialization pass. >> >> Streaming k-means uses a variant of dp-means to get a sketch of the data. >> This sketch has many more centroids than the final desired number of >> clusters k. As such, we don't have to worry much about the quality of the >> initialization of this first dp-means pass. The sketch is then clustered >> using something like k-means++ and ball k-means iteration. >> >> As I see it, having a good streaming k-means block should make any >> clustering algorithm better as long as the sketch is much smaller than the >> original data. >> >> The second phase of clustering after streaming k-means can benefit from >> any >> technique that makes clustering better. The caveat is just that streaming >> k-means is likely to produce a small enough sketch that parallel >> clustering >> may not help very much. Exactly where that trade-off applies depends a >> lot >> on the parallel framework available. Spark and h2o have much lower task >> initiation costs so they will definitely be able to get parallel speedup >> on >> much smaller datasets. >> >> >> >> >> On Fri, Apr 4, 2014 at 11:29 AM, Dmitriy Lyubimov <[email protected]> >> wrote: >> >> > is there any opinion whether one of pipe-pipe (||), dp-means >> > initializations has bona-fide advantages one over another? >> > >> > >> > On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning <[email protected]> >> > wrote: >> > >> > > On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov <[email protected]> >> > > wrote: >> > > >> > > > On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <[email protected] >> > >> > > > wrote: >> > > > >> > > > > - Have you considered sketch based algorithms? >> > > > > >> > > > >> > > > can you give me a reference. at this point i am just contemplating >> > more >> > > or >> > > > less shameless port of what they've done in mllib. >> > > > >> > > >> > > Here is the reference I used: >> > > >> > > >> > > >> > >> http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets >> > > >> > > *A quick summary:* >> > > >> > > - a single pass over the data with a sort of dp-means [1] algorithm >> > using a >> > > very fast approximate centroid search can give you k log N centroids. >> > This >> > > is the sketch. >> > > >> > > - clustering these centroids as weighted values gives a provably >> probably >> > > good clustering of the original data for well clusterable data. For >> real >> > > data, it seems to work exactly as advertised. >> > > >> > > - clustering the sketch using ball k-means with clever initialization >> is >> > > important. Good initialization is very expensive in terms of number >> of >> > > points so using a sketch is a really good idea. >> > > >> > > - the proofs all depend on Euclidean distance. >> > > >> > > - the threshold can start very small (what small means is determined >> > > empirically). Each time we wind up with too many centroids, >> recursively >> > > clustering the centroids and possibly increasing the threshold will >> keep >> > > the number reasonably bounded. >> > > >> > > *Some notes from practical experience:* >> > > >> > > - moderate levels of parallelism are easy since sketches can be merged >> > > using set union. You may want to recursively cluster the centroids at >> > this >> > > point if you have too many. This is very nice application of >> map-reduce. >> > > >> > > - high degrees of parallelism require multiple levels of >> merge/collapse >> > > since otherwise you wind up with a sketch nearly as large as the >> original >> > > data. If you have m parallel clusterings then m k log (N/m) can be >> > large. >> > > Take a billion points and m = 1000 and k = 10,000. The size of the >> > > parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N. But with m = >> > 100, k >> > > = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable. >> > > >> > > - ball k-means on highly clusterable data often uses a cutoff for >> > centroid >> > > computation of 0.5 x distance to nearest. I find that with real data >> > that >> > > 1 x distance or even larger to nearest is much better. The good >> effects >> > > are still mostly there, but you don't need wonderful data to succeed >> > > >> > > - the ball k-means algorithm that we have in Mahout is pretty high >> > quality, >> > > but could use a bit of triangle (Elkan [2] ) speedups. >> > > >> > > *References* >> > > >> > > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf >> > > [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf >> > > >> > > >> > > > >> > > > > >> > > > > - It can be important to use optimizations in the search for >> nearest >> > > > > centroid. Consider triangle optimizations. >> > > > > >> > > > > - Do you mean "parallel" when you type || or is there another >> meaning >> > > > > there? >> > > > > >> > > > >> > > > No, i mean method called "kmeans||". It's unfortunate name since I >> > really >> > > > don't know how to make google to search for it. >> > > > >> > > > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf >> > > > >> > > > > >> > > > > - When you say ++ initialization, many people get this wrong and >> > assume >> > > > > that you mean pick the furthest point. Getting really good >> > > > initialization >> > > > > is fairly difficult and typically requires more time than the >> actual >> > > > > clustering. This is one of the key benefits of sketch based >> methods. >> > > > > >> > > > > - Most algorithms require multiple restarts. At higher dimension >> the >> > > > > number of restarts required becomes very large. An ideal >> > > implementation >> > > > > does parallel sketch extraction followed by parallel ball k-means >> for >> > > > > restarts. >> > > > > >> > > > > >> > > > > >> > > > > On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov < >> [email protected]> >> > > > > wrote: >> > > > > >> > > > > > Considering porting implementation [1] and paper for KMeans || >> for >> > > > > > Bindings. >> > > > > > >> > > > > > This seems like another method to map fairly nicely. >> > > > > > >> > > > > > The problem I am contemplating is ||-initialization, and in >> > > particular, >> > > > > > centroid storage. That particular implementation assumes >> centroids >> > > > could >> > > > > be >> > > > > > kept in memory in front. >> > > > > > >> > > > > > (1) Question is, is it a dangerous idea. It doesn't seem like it >> > > > > > particularly is, since unlikely people would want more k>1e+6. >> > > Another >> > > > > > thing, centers seem to be passed in via closure attribute (i.e. >> > > > > > java-serialized array-backed matrix).However, with Bindings it >> is >> > > quite >> > > > > > possible to keep centers at the back as a matrix. >> > > > > > >> > > > > > (2) obviously, LLoyd iterations are not terribly accurate. || >> and >> > ++ >> > > > > > versions mostly speed things up. Is there any better-than-LLoyd >> > > > accuracy >> > > > > > preference? >> > > > > > >> > > > > > >> > > > > > [1] >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala >> > > > > > >> > > > > >> > > > >> > > >> > >> > >
