There is no combiner in the present implementation. Moreover the codepath that's executed when 'reduceStreamingKMeans' -rskm flag is set does not have adequate test coverage and needs to be tested more extensively. Most of the issues I had been seeing were due to specifying -rskm flag.
Amir had provided a dataset with about 300K points, could someone try running Streaming KMeans on this - both mapreduce and sequential versions? I have had no luck with either version. Here is the link to the dataset - http://gluegadget.com/split-vectors.tar.bz2 On Thursday, December 26, 2013 3:02 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: On Thu, Dec 26, 2013 at 10:19 AM, Suneel Marthi <suneel_mar...@yahoo.com> wrote: I heard people outside of dev@ and user@ who have tried running Streaming KMeans (from 0.8) on their Production clusters on large datasets and had seen the job crash in the Reduce phase due to OOM errors (this is with -Xmx2GB). Excessive memory usage in reduce was a known bug that was addressed (supposedly) by using a combiner. This really smells like bug resurrection happened somehow. Clearly that also means that our unit tests are insufficient.