I would push the code freeze until this is resolved (and the reason I had been holding off). This is something that should have been raised for 0.8 release and I dob;t think we should defer this to the next one.
I heard people outside of dev@ and user@ who have tried running Streaming KMeans (from 0.8) on their Production clusters on large datasets and had seen the job crash in the Reduce phase due to OOM errors (this is with -Xmx2GB). On Thursday, December 26, 2013 12:53 PM, Isabel Drost-Fromm <isa...@apache.org> wrote: On Thu, Dec 26, 2013 at 12:28:18AM -0800, Suneel Marthi wrote: > Its when you increase the no. of documents and the size of each > document (add more dimensions) that you start seeing performance issues >which are: > a)The Mappers take long to complete and its either the searcher.remove() or > searcher.searchFirst() calls (will check again in my next attempt) that seems > to be the bottleneck. > b) Once the Mappers complete (after several hours) the Reducer dies with an > OOM exception (despite having set -Xmx2G). Given that there seem to be a couple of people experiencing issues I think it makes sense to create a JIRA issue here to track progress - either code improvements or better documentation on how to run this implementation. @Suneel: Does it make sense to push code freeze to after fixing this or should this be communicated as a known defect in the release notes? Isabel