On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <[email protected]> wrote:
> - Have you considered sketch based algorithms? > can you give me a reference. at this point i am just contemplating more or less shameless port of what they've done in mllib. > > - It can be important to use optimizations in the search for nearest > centroid. Consider triangle optimizations. > > - Do you mean "parallel" when you type || or is there another meaning > there? > No, i mean method called "kmeans||". It's unfortunate name since I really don't know how to make google to search for it. http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf > > - When you say ++ initialization, many people get this wrong and assume > that you mean pick the furthest point. Getting really good initialization > is fairly difficult and typically requires more time than the actual > clustering. This is one of the key benefits of sketch based methods. > > - Most algorithms require multiple restarts. At higher dimension the > number of restarts required becomes very large. An ideal implementation > does parallel sketch extraction followed by parallel ball k-means for > restarts. > > > > On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <[email protected]> > wrote: > > > Considering porting implementation [1] and paper for KMeans || for > > Bindings. > > > > This seems like another method to map fairly nicely. > > > > The problem I am contemplating is ||-initialization, and in particular, > > centroid storage. That particular implementation assumes centroids could > be > > kept in memory in front. > > > > (1) Question is, is it a dangerous idea. It doesn't seem like it > > particularly is, since unlikely people would want more k>1e+6. Another > > thing, centers seem to be passed in via closure attribute (i.e. > > java-serialized array-backed matrix).However, with Bindings it is quite > > possible to keep centers at the back as a matrix. > > > > (2) obviously, LLoyd iterations are not terribly accurate. || and ++ > > versions mostly speed things up. Is there any better-than-LLoyd accuracy > > preference? > > > > > > [1] > > > > > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala > > >
