Of course, this is just for one pass of k-means.  If you need 4 passes, you
have break-even.

More typically for big data problems, k=1000 or some such.  Total number of
distance computations for streaming k-means will still be about 40 (or
adjust to the more theory motivated value of log k + log log N = 10 + 5 and
then adjust with a bit of fudge for real world).

For k-means in that case, you still have 1000 distances to compute per pass
and multiple passes to do.  That ratio then becomes something more like
10,000 / 40 = 250.



On Fri, Dec 27, 2013 at 12:55 PM, Johannes Schulte <
johannes.schu...@gmail.com> wrote:

> I updated the repository (with the typo)
>
> g...@github.com:baunz/cluster-comprarison.git
>
> to include more logging information about the number of times the distance
> measure calculation is triggered (which is the most expensive thing imo).
> the factor of dist. measure calculations per point seen is about 40 at
> streaming k-means and 10 for regular k-means (because there are 10
> clusters).
>
> This is of course dependent on the searchSize Parameter but i used the
> default value of 2.
>
>
>
> On Fri, Dec 27, 2013 at 6:54 PM, Isabel Drost-Fromm <isa...@apache.org
> >wrote:
>
> >
> > Hi Dan,
> >
> >
> > On Fri, 27 Dec 2013 14:13:51 +0200
> > Dan Filimon <dfili...@apache.org> wrote:
> > > Thoughts?
> >
> > First of all - good to see you back on dev@ :)
> >
> > Seems a few people have run into these issues. As currently there is no
> > high level documentation for the whole streaming kmeans implementation
> > - would you mind writing up the limitation and advise you have for users
> > of this algorithm? Doesn't need to be anything fancy - essentially a
> > here's how you compute how much memory you need to run this, here's the
> > limitations and the flags to deal with these, here's things that should
> > be changed or fixed in a later iteration - unless your previous mail
> > covers all of this already. This could safe people a few debugging
> > cycles when getting started with this at scale.
> >
> > Feel free to get it into our web page (if you are short in time, just
> > write something up using markdown, I can take over publishing it).
> >
> > Isabel
> >
>

Reply via email to