That the algorithm runs a single reducer is expected. The algorithm
creates a sketch of the data in parallel in the map-phase, which is
collected by the reducer afterwards. The reducer then applies an
expensive in-memory clustering algorithm to the sketch.

Which dataset are you using for testing? I can also do some tests on a
cluster here.

I can imagine two possible causes for the problems: Maybe there's a
problem with the vectors and some calculations take very long because
the wrong access pattern or implementation is chosen.

Another problem could be that the mappers and reducers have too few
memory and spend a lot of time running garbage collections.

--sebastian


On 23.12.2013 22:14, Suneel Marthi wrote:
> Has anyone be successful running Streaming KMeans clustering on a large 
> dataset (> 100,000 points)?
> 
> 
> It just seems to take a very long time (> 4hrs) for the mappers to finish on 
> about 300K data points and the reduce phase has only a single reducer running 
> and throws an OOM failing the job several hours after the job has been kicked 
> off.
> 
> Its the same story when trying to run in sequential mode.
> 
> Looking at the code the bottleneck seems to be in 
> StreamingKMeans.clusterInternal(), without understanding the behaviour of the 
> algorithm I am not sure if the sequence of steps in there is correct. 
> 
> 
> There are few calls that call themselves repeatedly over and over again like 
> SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> 
> We really need to have this working on datasets that are larger than 20K 
> reuters datasets.
> 
> I am trying to run this on 300K vectors with k= 100, km = 1261 and 
> FastProjectSearch.
> 

Reply via email to