Re: Streaming KMeans clustering

Johannes Schulte Wed, 25 Dec 2013 14:08:26 -0800

everybody should have the right to do

job.getConfiguration().set("mapred.reduce.child.java.opts", "-Xmx2G");


for that :)


For my problems, i always felt the sketching took too long. i put up a
simple comparison here:

g...@github.com:baunz/cluster-comprarison.git

it generates some sample vectors and clusters them with regular k-means,
and streaming k-means, both sequentially. i took 10 kmeans iterations as a
benchmark and used the default values for FastProjectionSearch from the
kMeans Driver Class.

Visual VM tells me the most time is spent in FastProjectionSearch.remove().
This is called on every added datapoint.

Maybe i got something wrong but for this sparse, high dimensional vectors i
never got streaming k-means faster than the regula version




On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi <suneel_mar...@yahoo.com>wrote:

> Not sure how that would work in a corporate setting wherein there's a
> fixed systemwide setting that cannot be overridden.
>
> Sent from my iPhone
>
> > On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter <s...@apache.org>
> wrote:
> >
> >> On 25.12.2013 14:19, Suneel Marthi wrote:
> >>
> >>
> >>
> >>
> >>
> >>>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
> >>
> >>>> For reference, on a 16 core machine, I was able to run the sequential
> >>>> version of streaming k-means on 1,000,000 points, each with 10
> dimensions
> >>>> in about 20 seconds.  The map-reduce versions are comparable subject
> to
> >>>> scaling except for startup time.
> >>
> >> @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8.
> Not sure how this would have even worked for u in sequential mode in light
> of the issues reported against M-1314, M-1358, M-1380 (all of which impact
> the sequential mode); unless u had fixed them locally.
> >> What were ur estimatedDistanceCutoff, number of clusters 'k',
> projection search and how much memory did u have to allocate to the single
> Reducer?
> >
> > If I read the source code correctly, the final reducer clusters the
> > sketch which should contain m * k * log n intermediate centroids, where
> > k is the number of desired clusters, m is the number of mappers run and
> > n is the number of datapoints. Those centroids are expected to be dense,
> > so we can estimate the memory required for the final reducer using this
> > formula.
> >
> >>
> >>
> >>
> >>
> >>> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <s...@apache.org>
> wrote:
> >>>
> >>> That the algorithm runs a single reducer is expected. The algorithm
> >>> creates a sketch of
> >> the data in parallel in the map-phase, which is
> >>> collected by the reducer afterwards. The reducer then applies an
> >>> expensive in-memory clustering algorithm to the sketch.
> >>>
> >>> Which dataset are you using for testing? I can also do some tests on a
> >>> cluster here.
> >>>
> >>> I can imagine two possible causes for the problems: Maybe there's a
> >>> problem with the vectors and some calculations take very long because
> >>> the wrong access pattern or implementation is chosen.
> >>>
> >>> Another problem could be that the mappers and reducers have too few
> >>> memory and spend a lot of time running garbage collections.
> >>>
> >>> --sebastian
> >>>
> >>>
> >>> On 23.12.2013 22:14,
> >> Suneel Marthi wrote:
> >>>> Has anyone be successful running Streaming KMeans clustering on a
> large
> >>> dataset (> 100,000 points)?
> >>>>
> >>>>
> >>>> It just seems to take a very long time (> 4hrs) for the mappers to
> >>> finish on about 300K data points and the reduce phase has only a single
> >>> reducer running and throws an OOM failing the job several hours after
> the
> >>> job has been kicked off.
> >>>>
> >>>> Its the same story when trying to run in sequential mode.
> >>>>
> >>>> Looking at the code the bottleneck seems to be in
> >>> StreamingKMeans.clusterInternal(), without understanding the behaviour
> of
> >>> the algorithm I am not sure if the sequence of steps in there is
> correct.
> >>>>
> >>>>
> >>>> There are few calls that call themselves repeatedly over and over
> again
> >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> >>>>
> >>>> We really need to have this working on datasets that are larger than
> 20K
> >>> reuters datasets.
> >>>>
> >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
> >>> FastProjectSearch.
> >
>

Re: Streaming KMeans clustering

Reply via email to