Re: KMeans|| opinions

Dmitriy Lyubimov Thu, 03 Apr 2014 23:15:12 -0700

On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <[email protected]> wrote:


> - Have you considered sketch based algorithms?
>

can you give me a reference. at this point i am just  contemplating more or
less shameless port of what they've done in mllib.


>
> - It can be important to use optimizations in the search for nearest
> centroid.  Consider triangle optimizations.
>
> - Do you mean "parallel" when you type || or is there another meaning
> there?
>

No, i mean method called "kmeans||". It's unfortunate name since I really
don't know how to make google to search for it.

http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

>
> - When you say ++ initialization, many people get this wrong and assume
> that you mean pick the furthest point.  Getting really good initialization
> is fairly difficult and typically requires more time than the actual
> clustering.  This is one of the key benefits of sketch based methods.
>
> - Most algorithms require multiple restarts.  At higher dimension the
> number of restarts required becomes very large.  An ideal implementation
> does parallel sketch extraction followed by parallel ball k-means for
> restarts.
>
>
>
> On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > Considering porting implementation [1] and paper for KMeans || for
> > Bindings.
> >
> > This seems like another method to map fairly nicely.
> >
> > The problem I am contemplating is ||-initialization, and in particular,
> > centroid storage. That particular implementation assumes centroids could
> be
> > kept in memory in front.
> >
> > (1) Question is, is it a dangerous idea. It doesn't seem like it
> > particularly is, since unlikely people would want more k>1e+6. Another
> > thing, centers seem to be passed in via closure attribute (i.e.
> > java-serialized array-backed matrix).However, with Bindings it is quite
> > possible to keep centers at the back as a matrix.
> >
> > (2) obviously, LLoyd iterations are not terribly accurate. || and ++
> > versions mostly speed things up. Is there any better-than-LLoyd accuracy
> > preference?
> >
> >
> > [1]
> >
> >
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
> >
>

Re: KMeans|| opinions

Reply via email to