To be honest, i always cancelled the sketching after a while because i wasn't satisfied with the points per second speed. The version used is the 0.8 release.
if i find the time i'm gonna look what is called when and where and how often and what the problem could be. On Thu, Dec 26, 2013 at 8:22 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Interesting. In Dan's tests on sparse data, he got about 10x speedup net. > > You didn't run multiple sketching passes did you? > > > Also, which version? There was a horrendous clone in there at one time. > > > > > On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte < > johannes.schu...@gmail.com> wrote: > > > everybody should have the right to do > > > > job.getConfiguration().set("mapred.reduce.child.java.opts", "-Xmx2G"); > > > > for that :) > > > > > > For my problems, i always felt the sketching took too long. i put up a > > simple comparison here: > > > > g...@github.com:baunz/cluster-comprarison.git > > > > it generates some sample vectors and clusters them with regular k-means, > > and streaming k-means, both sequentially. i took 10 kmeans iterations as > a > > benchmark and used the default values for FastProjectionSearch from the > > kMeans Driver Class. > > > > Visual VM tells me the most time is spent in > FastProjectionSearch.remove(). > > This is called on every added datapoint. > > > > Maybe i got something wrong but for this sparse, high dimensional > vectors i > > never got streaming k-means faster than the regula version > > > > > > > > > > On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi <suneel_mar...@yahoo.com > > >wrote: > > > > > Not sure how that would work in a corporate setting wherein there's a > > > fixed systemwide setting that cannot be overridden. > > > > > > Sent from my iPhone > > > > > > > On Dec 25, 2013, at 9:44 AM, Sebastian Schelter <s...@apache.org> > > > wrote: > > > > > > > >> On 25.12.2013 14:19, Suneel Marthi wrote: > > > >> > > > >> > > > >> > > > >> > > > >> > > > >>>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning < > > > ted.dunn...@gmail.com> wrote: > > > >> > > > >>>> For reference, on a 16 core machine, I was able to run the > > sequential > > > >>>> version of streaming k-means on 1,000,000 points, each with 10 > > > dimensions > > > >>>> in about 20 seconds. The map-reduce versions are comparable > subject > > > to > > > >>>> scaling except for startup time. > > > >> > > > >> @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. > > > Not sure how this would have even worked for u in sequential mode in > > light > > > of the issues reported against M-1314, M-1358, M-1380 (all of which > > impact > > > the sequential mode); unless u had fixed them locally. > > > >> What were ur estimatedDistanceCutoff, number of clusters 'k', > > > projection search and how much memory did u have to allocate to the > > single > > > Reducer? > > > > > > > > If I read the source code correctly, the final reducer clusters the > > > > sketch which should contain m * k * log n intermediate centroids, > where > > > > k is the number of desired clusters, m is the number of mappers run > and > > > > n is the number of datapoints. Those centroids are expected to be > > dense, > > > > so we can estimate the memory required for the final reducer using > this > > > > formula. > > > > > > > >> > > > >> > > > >> > > > >> > > > >>> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter < > s...@apache.org> > > > wrote: > > > >>> > > > >>> That the algorithm runs a single reducer is expected. The algorithm > > > >>> creates a sketch of > > > >> the data in parallel in the map-phase, which is > > > >>> collected by the reducer afterwards. The reducer then applies an > > > >>> expensive in-memory clustering algorithm to the sketch. > > > >>> > > > >>> Which dataset are you using for testing? I can also do some tests > on > > a > > > >>> cluster here. > > > >>> > > > >>> I can imagine two possible causes for the problems: Maybe there's a > > > >>> problem with the vectors and some calculations take very long > because > > > >>> the wrong access pattern or implementation is chosen. > > > >>> > > > >>> Another problem could be that the mappers and reducers have too few > > > >>> memory and spend a lot of time running garbage collections. > > > >>> > > > >>> --sebastian > > > >>> > > > >>> > > > >>> On 23.12.2013 22:14, > > > >> Suneel Marthi wrote: > > > >>>> Has anyone be successful running Streaming KMeans clustering on a > > > large > > > >>> dataset (> 100,000 points)? > > > >>>> > > > >>>> > > > >>>> It just seems to take a very long time (> 4hrs) for the mappers to > > > >>> finish on about 300K data points and the reduce phase has only a > > single > > > >>> reducer running and throws an OOM failing the job several hours > after > > > the > > > >>> job has been kicked off. > > > >>>> > > > >>>> Its the same story when trying to run in sequential mode. > > > >>>> > > > >>>> Looking at the code the bottleneck seems to be in > > > >>> StreamingKMeans.clusterInternal(), without understanding the > > behaviour > > > of > > > >>> the algorithm I am not sure if the sequence of steps in there is > > > correct. > > > >>>> > > > >>>> > > > >>>> There are few calls that call themselves repeatedly over and over > > > again > > > >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). > > > >>>> > > > >>>> We really need to have this working on datasets that are larger > than > > > 20K > > > >>> reuters datasets. > > > >>>> > > > >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and > > > >>> FastProjectSearch. > > > > > > > > > >