Has anyone be successful running Streaming KMeans clustering on a large dataset (> 100,000 points)?
It just seems to take a very long time (> 4hrs) for the mappers to finish on about 300K data points and the reduce phase has only a single reducer running and throws an OOM failing the job several hours after the job has been kicked off. Its the same story when trying to run in sequential mode. Looking at the code the bottleneck seems to be in StreamingKMeans.clusterInternal(), without understanding the behaviour of the algorithm I am not sure if the sequence of steps in there is correct. There are few calls that call themselves repeatedly over and over again like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). We really need to have this working on datasets that are larger than 20K reuters datasets. I am trying to run this on 300K vectors with k= 100, km = 1261 and FastProjectSearch.