Okay, understood. For a lot of clusters (which i don't necessarily
attribute to big data problems but definetly to nearest neighbour like
usage of clusters), the every "cluster sees every point" doesnt scale well.

However, for (final) 1000 clusters i see around 200 distance measure
calculations per point, a lot better than 1000. And another thing that
makes me think is that the test runs A LOT faster than with 10 final
clusters.


On Sat, Dec 28, 2013 at 1:09 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Of course, this is just for one pass of k-means.  If you need 4 passes, you
> have break-even.
>
> More typically for big data problems, k=1000 or some such.  Total number of
> distance computations for streaming k-means will still be about 40 (or
> adjust to the more theory motivated value of log k + log log N = 10 + 5 and
> then adjust with a bit of fudge for real world).
>
> For k-means in that case, you still have 1000 distances to compute per pass
> and multiple passes to do.  That ratio then becomes something more like
> 10,000 / 40 = 250.
>
>
>
> On Fri, Dec 27, 2013 at 12:55 PM, Johannes Schulte <
> johannes.schu...@gmail.com> wrote:
>
> > I updated the repository (with the typo)
> >
> > g...@github.com:baunz/cluster-comprarison.git
> >
> > to include more logging information about the number of times the
> distance
> > measure calculation is triggered (which is the most expensive thing imo).
> > the factor of dist. measure calculations per point seen is about 40 at
> > streaming k-means and 10 for regular k-means (because there are 10
> > clusters).
> >
> > This is of course dependent on the searchSize Parameter but i used the
> > default value of 2.
> >
> >
> >
> > On Fri, Dec 27, 2013 at 6:54 PM, Isabel Drost-Fromm <isa...@apache.org
> > >wrote:
> >
> > >
> > > Hi Dan,
> > >
> > >
> > > On Fri, 27 Dec 2013 14:13:51 +0200
> > > Dan Filimon <dfili...@apache.org> wrote:
> > > > Thoughts?
> > >
> > > First of all - good to see you back on dev@ :)
> > >
> > > Seems a few people have run into these issues. As currently there is no
> > > high level documentation for the whole streaming kmeans implementation
> > > - would you mind writing up the limitation and advise you have for
> users
> > > of this algorithm? Doesn't need to be anything fancy - essentially a
> > > here's how you compute how much memory you need to run this, here's the
> > > limitations and the flags to deal with these, here's things that should
> > > be changed or fixed in a later iteration - unless your previous mail
> > > covers all of this already. This could safe people a few debugging
> > > cycles when getting started with this at scale.
> > >
> > > Feel free to get it into our web page (if you are short in time, just
> > > write something up using markdown, I can take over publishing it).
> > >
> > > Isabel
> > >
> >
>

Reply via email to