Streaming KMeans is now part of examples/cluster-reuters.sh in the trunk (for 0.9 release). While it runs successfully (both sequential and MapReduce versions), its only clustering a collection about 21K small documents (from the Reuters-21578 corpus) and the values I had used were k = 10, km = 100 and rskm = true -sc FastProjectionSearch -dm CosineDistanceMeasure .
Its when you increase the no. of documents and the size of each document (add more dimensions) that you start seeing performance issues which are: a) The Mappers take long to complete and its either the searcher.remove() or searcher.searchFirst() calls (will check again in my next attempt) that seems to be the bottleneck. b) Once the Mappers complete (after several hours) the Reducer dies with an OOM exception (despite having set -Xmx2G). BTW, I was using a 300K document corpus that Amir had provided and is available at http://gluegadget.com/split-vectors.tar.bz2 and the values were k = 1000, km = 12610, sc = FastProjectionSearch. Each mapper was taking about 45 minutes to complete (with the aforementioned CLI args) and my Reducer just failed when it timed out. Amir had reported that the MapReduce version completed successfully after 16 hrs. He was running on a single node, pseudo-distributed Hadoop and had to set 'mapred.reduce.child.java.opts" to "-Xmx10240M" and "mapred.task.timeout" to "360000000" (100 days!) for this to complete successfully. If I increased the number of clusters k = 200,000 then the mappers finish real quick (about 2-3 mins each), but the Reducer again fails with either a timeout or OOM exception. I did not set my 'mapred.task.timeout' to be 100 days (that's definitely not gonna cut it in a corporate environment). On Thursday, December 26, 2013 2:58 AM, Johannes Schulte <[email protected]> wrote: To be honest, i always cancelled the sketching after a while because i wasn't satisfied with the points per second speed. The version used is the 0.8 release. if i find the time i'm gonna look what is called when and where and how often and what the problem could be. On Thu, Dec 26, 2013 at 8:22 AM, Ted Dunning <[email protected]> wrote: > Interesting. In Dan's tests on sparse data, he got about 10x speedup net. > > You didn't run multiple sketching passes did you? > > > Also, which version? There was a horrendous clone in there at one time. > > > > > On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte < > [email protected]> wrote: > > > everybody should have the right to do > > > > job.getConfiguration().set("mapred.reduce.child.java.opts", "-Xmx2G"); > > > > for that :) > > > > > > For my problems, i always felt the sketching took too long. i put up a > > simple comparison here: > > > > [email protected]:baunz/cluster-comprarison.git > > > > it generates some sample vectors and clusters them with regular k-means, > > and streaming k-means, both sequentially. i took 10 kmeans iterations as > a > > benchmark and used the default values for FastProjectionSearch from the > > kMeans Driver Class. > > > > Visual VM tells me the most time is spent in > FastProjectionSearch.remove(). > > This is called on every added datapoint. > > > > Maybe i got something wrong but for this sparse, high dimensional > vectors i > > never got streaming k-means faster than the regula version > > > > > > > > > > On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi <[email protected] > > >wrote: > > > > > Not sure how that would work in a corporate setting wherein there's a > > > fixed systemwide setting that cannot be overridden. > > > > > > Sent from my iPhone > > > > > > > On Dec 25, 2013, at 9:44 AM, Sebastian Schelter <[email protected]> > > > wrote: > > > > > > > >> On 25.12.2013 14:19, Suneel Marthi wrote: > > > >> > > > >> > > > >> > > > >> > > > >> > > > >>>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning < > > > [email protected]> wrote: > > > >> > > > >>>> For reference, on a 16 core machine, I was able to run the > > sequential > > > >>>> version of streaming k-means on 1,000,000 points, each with 10 > > > dimensions > > > >>>> in about 20 seconds. The map-reduce versions are comparable > subject > > > to > > > >>>> scaling except for startup time. > > > >> > > > >> @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. > > > Not sure how this would have even worked for u in sequential mode in > > light > > > of the issues reported against M-1314, M-1358, M-1380 (all of which > > impact > > > the sequential mode); unless u had fixed them locally. > > > >> What were ur estimatedDistanceCutoff, number of clusters 'k', > > > projection search and how much memory did u have to allocate to the > > single > > > Reducer? > > > > > > > > If I read the source code correctly, the final reducer clusters the > > > > sketch which should contain m * k * log n intermediate centroids, > where > > > > k is the number of desired clusters, m is the number of mappers run > and > > > > n is the number of datapoints. Those centroids are expected to be > > dense, > > > > so we can estimate the memory required for the final reducer using > this > > > > formula. > > > > > > > >> > > > >> > > > >> > > > >> > > > >>> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter < > [email protected]> > > > wrote: > > > >>> > > > >>> That the algorithm runs a single reducer is expected. The algorithm > > > >>> creates a sketch of > > > >> the data in parallel in the map-phase, which is > > > >>> collected by the reducer afterwards. The reducer then applies an > > > >>> expensive in-memory clustering algorithm to the sketch. > > > >>> > > > >>> Which dataset are you using for testing? I can also do some tests > on > > a > > > >>> cluster here. > > > >>> > > > >>> I can imagine two possible causes for the problems: Maybe there's a > > > >>> problem with the vectors and some calculations take very long > because > > > >>> the wrong access pattern or implementation is chosen. > > > >>> > > > >>> Another problem could be that the mappers and reducers have too few > > > >>> memory and spend a lot of time running garbage collections. > > > >>> > > > >>> --sebastian > > > >>> > > > >>> > > > >>> On 23.12.2013 22:14, > > > >> Suneel Marthi wrote: > > > >>>> Has anyone be successful running Streaming KMeans clustering on a > > > large > > > >>> dataset (> 100,000 points)? > > > >>>> > > > >>>> > > > >>>> It just seems to take a very long time (> 4hrs) for the mappers to > > > >>> finish on about 300K data points and the reduce phase has only a > > single > > > >>> reducer running and throws an OOM failing the job several hours > after > > > the > > > >>> job has been kicked off. > > > >>>> > > > >>>> Its the same story when trying to run in sequential mode. > > > >>>> > > > >>>> Looking at the code the bottleneck seems to be in > > > >>> StreamingKMeans.clusterInternal(), without understanding the > > behaviour > > > of > > > >>> the algorithm I am not sure if the sequence of steps in there is > > > correct. > > > >>>> > > > >>>> > > > >>>> There are few calls that call themselves repeatedly over and over > > > again > > > >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). > > > >>>> > > > >>>> We really need to have this working on datasets that are larger > than > > > 20K > > > >>> reuters datasets. > > > >>>> > > > >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and > > > >>> FastProjectSearch. > > > > > > > > > >
