There is no combiner in the present implementation.  Moreover the codepath 
that's executed when 'reduceStreamingKMeans' -rskm flag is set does not have 
adequate test coverage and needs to be tested more extensively. Most of the 
issues I had been seeing were due to specifying -rskm flag.  

Amir had provided a dataset with about 300K points, could someone try running 
Streaming KMeans on this - both mapreduce and sequential versions? I have had 
no luck with either version. Here is the link to the dataset - 
http://gluegadget.com/split-vectors.tar.bz2





On Thursday, December 26, 2013 3:02 PM, Ted Dunning <ted.dunn...@gmail.com> 
wrote:
 


On Thu, Dec 26, 2013 at 10:19 AM, Suneel Marthi <suneel_mar...@yahoo.com> wrote:

I heard people outside of dev@ and user@ who have tried running Streaming 
KMeans (from 0.8) on their Production clusters on large datasets and had seen 
the job crash in the Reduce phase due to OOM errors (this is with -Xmx2GB).
Excessive memory usage in reduce was a known bug that was addressed 
(supposedly) by using a combiner.

This really smells like bug resurrection happened somehow.  Clearly that also 
means that our unit tests are insufficient.

Reply via email to