Hi Johannes, can you share some details about the dataset that you ran streaming k-means on (number of datapoints, cardinality, etc)?
@Ted/Suneel Shouldn't the approximate searching techniques (e.g. projection search) help cope with high dimensional inputs? --sebastian On 25.12.2013 10:42, Johannes Schulte wrote: > Hi, > > i also had problems getting up to speed but i made the cardinality of the > vectors responsible for that. i didn't do the math exactly but while > streaming k-means improves over regular k-means in using log(k) and > (n_umber of datapoints / k) passes, the d_imension parameter from the > original k*d*n stays untouched, right? > > What is your vector's cardinality? > > > On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi <[email protected]>wrote: > >> Ted, >> >> What were the CLI parameters when you ran this test for 1M points - no. of >> clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff? >> >> >> >> >> >> >> >> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <[email protected]> >> wrote: >> >> For reference, on a 16 core machine, I was able to run the sequential >> version of streaming k-means on 1,000,000 points, each with 10 dimensions >> in about 20 seconds. The map-reduce versions are comparable subject to >> scaling except for startup time. >> >> >> >> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <[email protected]> >> wrote: >> >>> That the algorithm runs a single reducer is expected. The algorithm >>> creates a sketch of the data in parallel in the map-phase, which is >>> collected by the reducer afterwards. The reducer then applies an >>> expensive in-memory clustering algorithm to the sketch. >>> >>> Which dataset are you using for testing? I can also do some tests on a >>> cluster here. >>> >>> I can imagine two possible causes for the problems: Maybe there's a >>> problem with the vectors and some calculations take very long because >>> the wrong access pattern or implementation is chosen. >>> >>> Another problem could be that the mappers and reducers have too few >>> memory and spend a lot of time running garbage collections. >>> >>> --sebastian >>> >>> >>> On 23.12.2013 22:14, Suneel Marthi wrote: >>>> Has anyone be successful running Streaming KMeans clustering on a large >>> dataset (> 100,000 points)? >>>> >>>> >>>> It just seems to take a very long time (> 4hrs) for the mappers to >>> finish on about 300K data points and the reduce phase has only a single >>> reducer running and throws an OOM failing the job several hours after the >>> job has been kicked off. >>>> >>>> Its the same story when trying to run in sequential mode. >>>> >>>> Looking at the code the bottleneck seems to be in >>> StreamingKMeans.clusterInternal(), without understanding the behaviour of >>> the algorithm I am not sure if the sequence of steps in there is correct. >>>> >>>> >>>> There are few calls that call themselves repeatedly over and over again >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). >>>> >>>> We really need to have this working on datasets that are larger than >> 20K >>> reuters datasets. >>>> >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and >>> FastProjectSearch. >>>> >>> >>> >> >
