Okay, understood. For a lot of clusters (which i don't necessarily attribute to big data problems but definetly to nearest neighbour like usage of clusters), the every "cluster sees every point" doesnt scale well.
However, for (final) 1000 clusters i see around 200 distance measure calculations per point, a lot better than 1000. And another thing that makes me think is that the test runs A LOT faster than with 10 final clusters. On Sat, Dec 28, 2013 at 1:09 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Of course, this is just for one pass of k-means. If you need 4 passes, you > have break-even. > > More typically for big data problems, k=1000 or some such. Total number of > distance computations for streaming k-means will still be about 40 (or > adjust to the more theory motivated value of log k + log log N = 10 + 5 and > then adjust with a bit of fudge for real world). > > For k-means in that case, you still have 1000 distances to compute per pass > and multiple passes to do. That ratio then becomes something more like > 10,000 / 40 = 250. > > > > On Fri, Dec 27, 2013 at 12:55 PM, Johannes Schulte < > johannes.schu...@gmail.com> wrote: > > > I updated the repository (with the typo) > > > > g...@github.com:baunz/cluster-comprarison.git > > > > to include more logging information about the number of times the > distance > > measure calculation is triggered (which is the most expensive thing imo). > > the factor of dist. measure calculations per point seen is about 40 at > > streaming k-means and 10 for regular k-means (because there are 10 > > clusters). > > > > This is of course dependent on the searchSize Parameter but i used the > > default value of 2. > > > > > > > > On Fri, Dec 27, 2013 at 6:54 PM, Isabel Drost-Fromm <isa...@apache.org > > >wrote: > > > > > > > > Hi Dan, > > > > > > > > > On Fri, 27 Dec 2013 14:13:51 +0200 > > > Dan Filimon <dfili...@apache.org> wrote: > > > > Thoughts? > > > > > > First of all - good to see you back on dev@ :) > > > > > > Seems a few people have run into these issues. As currently there is no > > > high level documentation for the whole streaming kmeans implementation > > > - would you mind writing up the limitation and advise you have for > users > > > of this algorithm? Doesn't need to be anything fancy - essentially a > > > here's how you compute how much memory you need to run this, here's the > > > limitations and the flags to deal with these, here's things that should > > > be changed or fixed in a later iteration - unless your previous mail > > > covers all of this already. This could safe people a few debugging > > > cycles when getting started with this at scale. > > > > > > Feel free to get it into our web page (if you are short in time, just > > > write something up using markdown, I can take over publishing it). > > > > > > Isabel > > > > > >