Of course, this is just for one pass of k-means. If you need 4 passes, you have break-even.
More typically for big data problems, k=1000 or some such. Total number of distance computations for streaming k-means will still be about 40 (or adjust to the more theory motivated value of log k + log log N = 10 + 5 and then adjust with a bit of fudge for real world). For k-means in that case, you still have 1000 distances to compute per pass and multiple passes to do. That ratio then becomes something more like 10,000 / 40 = 250. On Fri, Dec 27, 2013 at 12:55 PM, Johannes Schulte < johannes.schu...@gmail.com> wrote: > I updated the repository (with the typo) > > g...@github.com:baunz/cluster-comprarison.git > > to include more logging information about the number of times the distance > measure calculation is triggered (which is the most expensive thing imo). > the factor of dist. measure calculations per point seen is about 40 at > streaming k-means and 10 for regular k-means (because there are 10 > clusters). > > This is of course dependent on the searchSize Parameter but i used the > default value of 2. > > > > On Fri, Dec 27, 2013 at 6:54 PM, Isabel Drost-Fromm <isa...@apache.org > >wrote: > > > > > Hi Dan, > > > > > > On Fri, 27 Dec 2013 14:13:51 +0200 > > Dan Filimon <dfili...@apache.org> wrote: > > > Thoughts? > > > > First of all - good to see you back on dev@ :) > > > > Seems a few people have run into these issues. As currently there is no > > high level documentation for the whole streaming kmeans implementation > > - would you mind writing up the limitation and advise you have for users > > of this algorithm? Doesn't need to be anything fancy - essentially a > > here's how you compute how much memory you need to run this, here's the > > limitations and the flags to deal with these, here's things that should > > be changed or fixed in a later iteration - unless your previous mail > > covers all of this already. This could safe people a few debugging > > cycles when getting started with this at scale. > > > > Feel free to get it into our web page (if you are short in time, just > > write something up using markdown, I can take over publishing it). > > > > Isabel > > >