Hey Sebastian, it was a text like clustering problem with a dimensionality of 100 000, the number of data points could have have been million but i always cancelled it after a while (i used the java classes, not the command line version and monitored the progress).
As for my statements above: They are possibly not quite correct. Sure, the projection search reduces the amount of searching needed, but by the time i looked into the code, i identified two problems, if i remember correctly: - the searching of pending additions - the projection itself but i'll have to retry that and look into the code again. i ended up using the old k-means code on a sample of the data.. cheers, johannes On Wed, Dec 25, 2013 at 11:17 AM, Sebastian Schelter <s...@apache.org> wrote: > Hi Johannes, > > can you share some details about the dataset that you ran streaming > k-means on (number of datapoints, cardinality, etc)? > > @Ted/Suneel Shouldn't the approximate searching techniques (e.g. > projection search) help cope with high dimensional inputs? > > --sebastian > > > On 25.12.2013 10:42, Johannes Schulte wrote: > > Hi, > > > > i also had problems getting up to speed but i made the cardinality of the > > vectors responsible for that. i didn't do the math exactly but while > > streaming k-means improves over regular k-means in using log(k) and > > (n_umber of datapoints / k) passes, the d_imension parameter from the > > original k*d*n stays untouched, right? > > > > What is your vector's cardinality? > > > > > > On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi <suneel_mar...@yahoo.com > >wrote: > > > >> Ted, > >> > >> What were the CLI parameters when you ran this test for 1M points - no. > of > >> clusters k, km, distanceMeasure, projectionSearch, > estimatedDistanceCutoff? > >> > >> > >> > >> > >> > >> > >> > >> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning < > ted.dunn...@gmail.com> > >> wrote: > >> > >> For reference, on a 16 core machine, I was able to run the sequential > >> version of streaming k-means on 1,000,000 points, each with 10 > dimensions > >> in about 20 seconds. The map-reduce versions are comparable subject to > >> scaling except for startup time. > >> > >> > >> > >> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <s...@apache.org> > >> wrote: > >> > >>> That the algorithm runs a single reducer is expected. The algorithm > >>> creates a sketch of the data in parallel in the map-phase, which is > >>> collected by the reducer afterwards. The reducer then applies an > >>> expensive in-memory clustering algorithm to the sketch. > >>> > >>> Which dataset are you using for testing? I can also do some tests on a > >>> cluster here. > >>> > >>> I can imagine two possible causes for the problems: Maybe there's a > >>> problem with the vectors and some calculations take very long because > >>> the wrong access pattern or implementation is chosen. > >>> > >>> Another problem could be that the mappers and reducers have too few > >>> memory and spend a lot of time running garbage collections. > >>> > >>> --sebastian > >>> > >>> > >>> On 23.12.2013 22:14, Suneel Marthi wrote: > >>>> Has anyone be successful running Streaming KMeans clustering on a > large > >>> dataset (> 100,000 points)? > >>>> > >>>> > >>>> It just seems to take a very long time (> 4hrs) for the mappers to > >>> finish on about 300K data points and the reduce phase has only a single > >>> reducer running and throws an OOM failing the job several hours after > the > >>> job has been kicked off. > >>>> > >>>> Its the same story when trying to run in sequential mode. > >>>> > >>>> Looking at the code the bottleneck seems to be in > >>> StreamingKMeans.clusterInternal(), without understanding the behaviour > of > >>> the algorithm I am not sure if the sequence of steps in there is > correct. > >>>> > >>>> > >>>> There are few calls that call themselves repeatedly over and over > again > >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). > >>>> > >>>> We really need to have this working on datasets that are larger than > >> 20K > >>> reuters datasets. > >>>> > >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and > >>> FastProjectSearch. > >>>> > >>> > >>> > >> > > > >