Re: Streaming KMeans clustering

Suneel Marthi Thu, 26 Dec 2013 10:20:42 -0800

I would push the code freeze until this is resolved (and the reason I had been 
holding off). This is something that should have been raised for 0.8 release 
and I dob;t think we should defer this to the next one.

I heard people outside of dev@ and user@ who have tried running Streaming 
KMeans (from 0.8) on their Production clusters on large datasets and had seen 
the job crash in the Reduce phase due to OOM errors (this is with -Xmx2GB). 

On Thursday, December 26, 2013 12:53 PM, Isabel Drost-Fromm <isa...@apache.org> 
wrote:

On Thu, Dec 26, 2013 at 12:28:18AM -0800, Suneel Marthi wrote:

> Its when you increase the no. of documents and the size of each
>  document (add more dimensions) that you start seeing performance issues 
>which are:
> a)The Mappers take long to complete and its either the searcher.remove() or 
> searcher.searchFirst() calls (will check again in my next attempt) that seems 
> to be the bottleneck.
> b) Once the Mappers complete (after several hours) the Reducer dies with an 
> OOM exception (despite having set -Xmx2G).

Given that there seem to be a couple of people experiencing issues I think it 
makes sense to create a JIRA issue here to track progress - either code 
improvements or better documentation on how to run this implementation.

@Suneel: Does it make sense to push code freeze to after fixing this or should 
this be communicated as a known defect in the release notes?

Isabel

Re: Streaming KMeans clustering

Reply via email to