Re: Streaming KMeans clustering

Johannes Schulte Wed, 25 Dec 2013 01:44:13 -0800

Hi,

i also had problems getting up to speed but i made the cardinality of the
vectors responsible for that. i didn't do the math exactly but while
streaming k-means improves over regular k-means in using log(k) and
(n_umber of datapoints / k) passes, the d_imension parameter from the
original k*d*n stays untouched, right?


What is your vector's cardinality?


On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi <suneel_mar...@yahoo.com>wrote:

> Ted,
>
> What were the CLI parameters when you ran this test for 1M points - no. of
> clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff?
>
>
>
>
>
>
>
> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> For reference, on a 16 core machine, I was able to run the sequential
> version of streaming k-means on 1,000,000 points, each with 10 dimensions
> in about 20 seconds.  The map-reduce versions are comparable subject to
> scaling except for startup time.
>
>
>
> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <s...@apache.org>
> wrote:
>
> > That the algorithm runs a single reducer is expected. The algorithm
> > creates a sketch of the data in parallel in the map-phase, which is
> > collected by the reducer afterwards. The reducer then applies an
> > expensive in-memory clustering algorithm to the sketch.
> >
> > Which dataset are you using for testing? I can also do some tests on a
> > cluster here.
> >
> > I can imagine two possible causes for the problems: Maybe there's a
> > problem with the vectors and some calculations take very long because
> > the wrong access pattern or implementation is chosen.
> >
> > Another problem could be that the mappers and reducers have too few
> > memory and spend a lot of time running garbage collections.
> >
> > --sebastian
> >
> >
> > On 23.12.2013 22:14, Suneel Marthi wrote:
> > > Has anyone be successful running Streaming KMeans clustering on a large
> > dataset (> 100,000 points)?
> > >
> > >
> > > It just seems to take a very long time (> 4hrs) for the mappers to
> > finish on about 300K data points and the reduce phase has only a single
> > reducer running and throws an OOM failing the job several hours after the
> > job has been kicked off.
> > >
> > > Its the same story when trying to run in sequential mode.
> > >
> > > Looking at the code the bottleneck seems to be in
> > StreamingKMeans.clusterInternal(), without understanding the behaviour of
> > the algorithm I am not sure if the sequence of steps in there is correct.
> > >
> > >
> > > There are few calls that call themselves repeatedly over and over again
> > like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> > >
> > > We really need to have this working on datasets that are larger than
> 20K
> > reuters datasets.
> > >
> > > I am trying to run this on 300K vectors with k= 100, km = 1261 and
> > FastProjectSearch.
> > >
> >
> >
>

Re: Streaming KMeans clustering

Reply via email to