Re: Streaming KMeans clustering

Ted Dunning Wed, 25 Dec 2013 23:23:58 -0800

Interesting.  In Dan's tests on sparse data, he got about 10x speedup net.

You didn't run multiple sketching passes did you?



Also, which version?  There was a horrendous clone in there at one time.




On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte <
[email protected]> wrote:

> everybody should have the right to do
>
> job.getConfiguration().set("mapred.reduce.child.java.opts", "-Xmx2G");
>
> for that :)
>
>
> For my problems, i always felt the sketching took too long. i put up a
> simple comparison here:
>
> [email protected]:baunz/cluster-comprarison.git
>
> it generates some sample vectors and clusters them with regular k-means,
> and streaming k-means, both sequentially. i took 10 kmeans iterations as a
> benchmark and used the default values for FastProjectionSearch from the
> kMeans Driver Class.
>
> Visual VM tells me the most time is spent in FastProjectionSearch.remove().
> This is called on every added datapoint.
>
> Maybe i got something wrong but for this sparse, high dimensional vectors i
> never got streaming k-means faster than the regula version
>
>
>
>
> On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi <[email protected]
> >wrote:
>
> > Not sure how that would work in a corporate setting wherein there's a
> > fixed systemwide setting that cannot be overridden.
> >
> > Sent from my iPhone
> >
> > > On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter <[email protected]>
> > wrote:
> > >
> > >> On 25.12.2013 14:19, Suneel Marthi wrote:
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <
> > [email protected]> wrote:
> > >>
> > >>>> For reference, on a 16 core machine, I was able to run the
> sequential
> > >>>> version of streaming k-means on 1,000,000 points, each with 10
> > dimensions
> > >>>> in about 20 seconds.  The map-reduce versions are comparable subject
> > to
> > >>>> scaling except for startup time.
> > >>
> > >> @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8.
> > Not sure how this would have even worked for u in sequential mode in
> light
> > of the issues reported against M-1314, M-1358, M-1380 (all of which
> impact
> > the sequential mode); unless u had fixed them locally.
> > >> What were ur estimatedDistanceCutoff, number of clusters 'k',
> > projection search and how much memory did u have to allocate to the
> single
> > Reducer?
> > >
> > > If I read the source code correctly, the final reducer clusters the
> > > sketch which should contain m * k * log n intermediate centroids, where
> > > k is the number of desired clusters, m is the number of mappers run and
> > > n is the number of datapoints. Those centroids are expected to be
> dense,
> > > so we can estimate the memory required for the final reducer using this
> > > formula.
> > >
> > >>
> > >>
> > >>
> > >>
> > >>> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <[email protected]>
> > wrote:
> > >>>
> > >>> That the algorithm runs a single reducer is expected. The algorithm
> > >>> creates a sketch of
> > >> the data in parallel in the map-phase, which is
> > >>> collected by the reducer afterwards. The reducer then applies an
> > >>> expensive in-memory clustering algorithm to the sketch.
> > >>>
> > >>> Which dataset are you using for testing? I can also do some tests on
> a
> > >>> cluster here.
> > >>>
> > >>> I can imagine two possible causes for the problems: Maybe there's a
> > >>> problem with the vectors and some calculations take very long because
> > >>> the wrong access pattern or implementation is chosen.
> > >>>
> > >>> Another problem could be that the mappers and reducers have too few
> > >>> memory and spend a lot of time running garbage collections.
> > >>>
> > >>> --sebastian
> > >>>
> > >>>
> > >>> On 23.12.2013 22:14,
> > >> Suneel Marthi wrote:
> > >>>> Has anyone be successful running Streaming KMeans clustering on a
> > large
> > >>> dataset (> 100,000 points)?
> > >>>>
> > >>>>
> > >>>> It just seems to take a very long time (> 4hrs) for the mappers to
> > >>> finish on about 300K data points and the reduce phase has only a
> single
> > >>> reducer running and throws an OOM failing the job several hours after
> > the
> > >>> job has been kicked off.
> > >>>>
> > >>>> Its the same story when trying to run in sequential mode.
> > >>>>
> > >>>> Looking at the code the bottleneck seems to be in
> > >>> StreamingKMeans.clusterInternal(), without understanding the
> behaviour
> > of
> > >>> the algorithm I am not sure if the sequence of steps in there is
> > correct.
> > >>>>
> > >>>>
> > >>>> There are few calls that call themselves repeatedly over and over
> > again
> > >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> > >>>>
> > >>>> We really need to have this working on datasets that are larger than
> > 20K
> > >>> reuters datasets.
> > >>>>
> > >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
> > >>> FastProjectSearch.
> > >
> >
>

Re: Streaming KMeans clustering

Reply via email to