Re: Streaming KMeans clustering

Dan Filimon Fri, 27 Dec 2013 04:15:15 -0800

Hi everyone!

So for the two issues:

1. Mapper slowness: this is basically an issue with the searcher being
used. The default is ProjectionSearch which was doing a good job. If the
bottleneck is indeed remove or searchFirst, that sort of point outs a
limitation in the basic algorithm (unless it turns out there's something
super dumb going on).

2. Reducer OOM: for this job, if we have m mappers, clustering n points
into k clusters, each mapper should get roughly  n  / m points to cluster,
and produce k log (n / m) centroids. The total number of points that the
reducer gets is m * k * log (n / m).

As you can see, this means that this really depends on the particular data
set we're working with. Suppose k is n / 10 and you have m = 10 mappers.
That gets you 10 * n / 10 * log (n / 10) ~ n log n points that the reducer
has to cluster and really it makes this approach totally useless because
you'll have more points at the end than at the beginning.

In any case, if the number of reducer centroids (the m * k * log (n / m))
is acceptable, there's an option to run another StreamingKMeans in the
reducer: there's the reduceStreamingKMeans flag in the driver.

However, I feel that if you see yourself needing this flag, it probably
shows that this MapReduce approach is not what you want and you should just
run StreamingKMeans directly.

I think in retrospect, that there should be code that checks for this in
the driver and spits out a warning. :)

Thoughts?

(Happy Holidays to everyone too! :D)

On Fri, Dec 27, 2013 at 9:59 AM, Sotiris Salloumis <i...@eprice.gr> wrote:

> Hi Suneel,
>
> Is it possible to upload debug or log messages from the OOM exceptions you
> have seen to take a look on them?
>
> Regards
> Sotiris
>
>
> On Thu, Dec 26, 2013 at 8:19 PM, Suneel Marthi <suneel_mar...@yahoo.com
> >wrote:
>
> > I would push the code freeze until this is resolved (and the reason I had
> > been holding off). This is something that should have been raised for 0.8
> > release and I dob;t think we should defer this to the next one.
> >
> > I heard people outside of dev@ and user@ who have tried running
> Streaming
> > KMeans (from 0.8) on their Production clusters on large datasets and had
> > seen the job crash in the Reduce phase due to OOM errors (this is with
> > -Xmx2GB).
> >
> >
> >
> >
> >
> >
> > On Thursday, December 26, 2013 12:53 PM, Isabel Drost-Fromm <
> > isa...@apache.org> wrote:
> >
> > On Thu, Dec 26, 2013 at 12:28:18AM -0800, Suneel Marthi wrote:
> >
> > > Its when you increase the no. of documents and the size of each
> > >  document (add more dimensions) that you start seeing performance
> issues
> > which are:
> > > a)The Mappers take long to complete and its either the
> searcher.remove()
> > or searcher.searchFirst() calls (will check again in my next attempt)
> that
> > seems to be the bottleneck.
> > > b) Once the Mappers complete (after several hours) the Reducer dies
> with
> > an OOM exception (despite having set -Xmx2G).
> >
> > Given that there seem to be a couple of people experiencing issues I
> think
> > it makes sense to create a JIRA issue here to track progress - either
> code
> > improvements or better documentation on how to run this implementation.
> >
> > @Suneel: Does it make sense to push code freeze to after fixing this or
> > should this be communicated as a known defect in the release notes?
> >
> >
> > Isabel
>

Re: Streaming KMeans clustering

Reply via email to