For reference, on a 16 core machine, I was able to run the sequential
version of streaming k-means on 1,000,000 points, each with 10 dimensions
in about 20 seconds.  The map-reduce versions are comparable subject to
scaling except for startup time.


On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <s...@apache.org> wrote:

> That the algorithm runs a single reducer is expected. The algorithm
> creates a sketch of the data in parallel in the map-phase, which is
> collected by the reducer afterwards. The reducer then applies an
> expensive in-memory clustering algorithm to the sketch.
>
> Which dataset are you using for testing? I can also do some tests on a
> cluster here.
>
> I can imagine two possible causes for the problems: Maybe there's a
> problem with the vectors and some calculations take very long because
> the wrong access pattern or implementation is chosen.
>
> Another problem could be that the mappers and reducers have too few
> memory and spend a lot of time running garbage collections.
>
> --sebastian
>
>
> On 23.12.2013 22:14, Suneel Marthi wrote:
> > Has anyone be successful running Streaming KMeans clustering on a large
> dataset (> 100,000 points)?
> >
> >
> > It just seems to take a very long time (> 4hrs) for the mappers to
> finish on about 300K data points and the reduce phase has only a single
> reducer running and throws an OOM failing the job several hours after the
> job has been kicked off.
> >
> > Its the same story when trying to run in sequential mode.
> >
> > Looking at the code the bottleneck seems to be in
> StreamingKMeans.clusterInternal(), without understanding the behaviour of
> the algorithm I am not sure if the sequence of steps in there is correct.
> >
> >
> > There are few calls that call themselves repeatedly over and over again
> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> >
> > We really need to have this working on datasets that are larger than 20K
> reuters datasets.
> >
> > I am trying to run this on 300K vectors with k= 100, km = 1261 and
> FastProjectSearch.
> >
>
>

Reply via email to