is there any opinion whether one of pipe-pipe (||), dp-means initializations has bona-fide advantages one over another?
On Fri, Apr 4, 2014 at 12:24 AM, Ted Dunning <[email protected]> wrote: > On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov <[email protected]> > wrote: > > > On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <[email protected]> > > wrote: > > > > > - Have you considered sketch based algorithms? > > > > > > > can you give me a reference. at this point i am just contemplating more > or > > less shameless port of what they've done in mllib. > > > > Here is the reference I used: > > > http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets > > *A quick summary:* > > - a single pass over the data with a sort of dp-means [1] algorithm using a > very fast approximate centroid search can give you k log N centroids. This > is the sketch. > > - clustering these centroids as weighted values gives a provably probably > good clustering of the original data for well clusterable data. For real > data, it seems to work exactly as advertised. > > - clustering the sketch using ball k-means with clever initialization is > important. Good initialization is very expensive in terms of number of > points so using a sketch is a really good idea. > > - the proofs all depend on Euclidean distance. > > - the threshold can start very small (what small means is determined > empirically). Each time we wind up with too many centroids, recursively > clustering the centroids and possibly increasing the threshold will keep > the number reasonably bounded. > > *Some notes from practical experience:* > > - moderate levels of parallelism are easy since sketches can be merged > using set union. You may want to recursively cluster the centroids at this > point if you have too many. This is very nice application of map-reduce. > > - high degrees of parallelism require multiple levels of merge/collapse > since otherwise you wind up with a sketch nearly as large as the original > data. If you have m parallel clusterings then m k log (N/m) can be large. > Take a billion points and m = 1000 and k = 10,000. The size of the > parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N. But with m = 100, k > = 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable. > > - ball k-means on highly clusterable data often uses a cutoff for centroid > computation of 0.5 x distance to nearest. I find that with real data that > 1 x distance or even larger to nearest is much better. The good effects > are still mostly there, but you don't need wonderful data to succeed > > - the ball k-means algorithm that we have in Mahout is pretty high quality, > but could use a bit of triangle (Elkan [2] ) speedups. > > *References* > > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf > [2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf > > > > > > > > > > - It can be important to use optimizations in the search for nearest > > > centroid. Consider triangle optimizations. > > > > > > - Do you mean "parallel" when you type || or is there another meaning > > > there? > > > > > > > No, i mean method called "kmeans||". It's unfortunate name since I really > > don't know how to make google to search for it. > > > > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf > > > > > > > > - When you say ++ initialization, many people get this wrong and assume > > > that you mean pick the furthest point. Getting really good > > initialization > > > is fairly difficult and typically requires more time than the actual > > > clustering. This is one of the key benefits of sketch based methods. > > > > > > - Most algorithms require multiple restarts. At higher dimension the > > > number of restarts required becomes very large. An ideal > implementation > > > does parallel sketch extraction followed by parallel ball k-means for > > > restarts. > > > > > > > > > > > > On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <[email protected]> > > > wrote: > > > > > > > Considering porting implementation [1] and paper for KMeans || for > > > > Bindings. > > > > > > > > This seems like another method to map fairly nicely. > > > > > > > > The problem I am contemplating is ||-initialization, and in > particular, > > > > centroid storage. That particular implementation assumes centroids > > could > > > be > > > > kept in memory in front. > > > > > > > > (1) Question is, is it a dangerous idea. It doesn't seem like it > > > > particularly is, since unlikely people would want more k>1e+6. > Another > > > > thing, centers seem to be passed in via closure attribute (i.e. > > > > java-serialized array-backed matrix).However, with Bindings it is > quite > > > > possible to keep centers at the back as a matrix. > > > > > > > > (2) obviously, LLoyd iterations are not terribly accurate. || and ++ > > > > versions mostly speed things up. Is there any better-than-LLoyd > > accuracy > > > > preference? > > > > > > > > > > > > [1] > > > > > > > > > > > > > > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala > > > > > > > > > >
