Has anyone be successful running Streaming KMeans clustering on a large dataset 
(> 100,000 points)?


It just seems to take a very long time (> 4hrs) for the mappers to finish on 
about 300K data points and the reduce phase has only a single reducer running 
and throws an OOM failing the job several hours after the job has been kicked 
off.

Its the same story when trying to run in sequential mode.

Looking at the code the bottleneck seems to be in 
StreamingKMeans.clusterInternal(), without understanding the behaviour of the 
algorithm I am not sure if the sequence of steps in there is correct. 


There are few calls that call themselves repeatedly over and over again like 
SteamingKMeans.clusterInternal() and Searcher.searchFirst().

We really need to have this working on datasets that are larger than 20K 
reuters datasets.

I am trying to run this on 300K vectors with k= 100, km = 1261 and 
FastProjectSearch.

Reply via email to