Re: Streaming KMeans clustering

Johannes Schulte Wed, 25 Dec 2013 23:58:46 -0800

To be honest, i always cancelled the sketching after a while because i
wasn't satisfied with the points per second speed. The version used is the
0.8 release.


if i find the time i'm gonna look what is called when and where and how
often and what the problem could be.


On Thu, Dec 26, 2013 at 8:22 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Interesting.  In Dan's tests on sparse data, he got about 10x speedup net.
>
> You didn't run multiple sketching passes did you?
>
>
> Also, which version?  There was a horrendous clone in there at one time.
>
>
>
>
> On Wed, Dec 25, 2013 at 2:07 PM, Johannes Schulte <
> johannes.schu...@gmail.com> wrote:
>
> > everybody should have the right to do
> >
> > job.getConfiguration().set("mapred.reduce.child.java.opts", "-Xmx2G");
> >
> > for that :)
> >
> >
> > For my problems, i always felt the sketching took too long. i put up a
> > simple comparison here:
> >
> > g...@github.com:baunz/cluster-comprarison.git
> >
> > it generates some sample vectors and clusters them with regular k-means,
> > and streaming k-means, both sequentially. i took 10 kmeans iterations as
> a
> > benchmark and used the default values for FastProjectionSearch from the
> > kMeans Driver Class.
> >
> > Visual VM tells me the most time is spent in
> FastProjectionSearch.remove().
> > This is called on every added datapoint.
> >
> > Maybe i got something wrong but for this sparse, high dimensional
> vectors i
> > never got streaming k-means faster than the regula version
> >
> >
> >
> >
> > On Wed, Dec 25, 2013 at 3:49 PM, Suneel Marthi <suneel_mar...@yahoo.com
> > >wrote:
> >
> > > Not sure how that would work in a corporate setting wherein there's a
> > > fixed systemwide setting that cannot be overridden.
> > >
> > > Sent from my iPhone
> > >
> > > > On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter <s...@apache.org>
> > > wrote:
> > > >
> > > >> On 25.12.2013 14:19, Suneel Marthi wrote:
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <
> > > ted.dunn...@gmail.com> wrote:
> > > >>
> > > >>>> For reference, on a 16 core machine, I was able to run the
> > sequential
> > > >>>> version of streaming k-means on 1,000,000 points, each with 10
> > > dimensions
> > > >>>> in about 20 seconds.  The map-reduce versions are comparable
> subject
> > > to
> > > >>>> scaling except for startup time.
> > > >>
> > > >> @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8.
> > > Not sure how this would have even worked for u in sequential mode in
> > light
> > > of the issues reported against M-1314, M-1358, M-1380 (all of which
> > impact
> > > the sequential mode); unless u had fixed them locally.
> > > >> What were ur estimatedDistanceCutoff, number of clusters 'k',
> > > projection search and how much memory did u have to allocate to the
> > single
> > > Reducer?
> > > >
> > > > If I read the source code correctly, the final reducer clusters the
> > > > sketch which should contain m * k * log n intermediate centroids,
> where
> > > > k is the number of desired clusters, m is the number of mappers run
> and
> > > > n is the number of datapoints. Those centroids are expected to be
> > dense,
> > > > so we can estimate the memory required for the final reducer using
> this
> > > > formula.
> > > >
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <
> s...@apache.org>
> > > wrote:
> > > >>>
> > > >>> That the algorithm runs a single reducer is expected. The algorithm
> > > >>> creates a sketch of
> > > >> the data in parallel in the map-phase, which is
> > > >>> collected by the reducer afterwards. The reducer then applies an
> > > >>> expensive in-memory clustering algorithm to the sketch.
> > > >>>
> > > >>> Which dataset are you using for testing? I can also do some tests
> on
> > a
> > > >>> cluster here.
> > > >>>
> > > >>> I can imagine two possible causes for the problems: Maybe there's a
> > > >>> problem with the vectors and some calculations take very long
> because
> > > >>> the wrong access pattern or implementation is chosen.
> > > >>>
> > > >>> Another problem could be that the mappers and reducers have too few
> > > >>> memory and spend a lot of time running garbage collections.
> > > >>>
> > > >>> --sebastian
> > > >>>
> > > >>>
> > > >>> On 23.12.2013 22:14,
> > > >> Suneel Marthi wrote:
> > > >>>> Has anyone be successful running Streaming KMeans clustering on a
> > > large
> > > >>> dataset (> 100,000 points)?
> > > >>>>
> > > >>>>
> > > >>>> It just seems to take a very long time (> 4hrs) for the mappers to
> > > >>> finish on about 300K data points and the reduce phase has only a
> > single
> > > >>> reducer running and throws an OOM failing the job several hours
> after
> > > the
> > > >>> job has been kicked off.
> > > >>>>
> > > >>>> Its the same story when trying to run in sequential mode.
> > > >>>>
> > > >>>> Looking at the code the bottleneck seems to be in
> > > >>> StreamingKMeans.clusterInternal(), without understanding the
> > behaviour
> > > of
> > > >>> the algorithm I am not sure if the sequence of steps in there is
> > > correct.
> > > >>>>
> > > >>>>
> > > >>>> There are few calls that call themselves repeatedly over and over
> > > again
> > > >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> > > >>>>
> > > >>>> We really need to have this working on datasets that are larger
> than
> > > 20K
> > > >>> reuters datasets.
> > > >>>>
> > > >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
> > > >>> FastProjectSearch.
> > > >
> > >
> >
>

Re: Streaming KMeans clustering

Reply via email to