Re: Streaming KMeans clustering

Johannes Schulte Wed, 25 Dec 2013 02:41:11 -0800

Hey Sebastian,

it was a text like clustering problem with a dimensionality of 100 000, the
number of data points could have have been million but i always cancelled
it after a while (i used the java classes, not the command line version and
monitored the progress).


As for my statements above: They are possibly not quite correct. Sure, the
projection search reduces the amount of searching needed, but by the time i
looked into the code, i identified two problems, if i remember correctly:

- the searching of pending additions
- the projection itself


but i'll have to retry that and look into the code again. i ended up using
the old k-means code on a sample of the data..

cheers,

johannes


On Wed, Dec 25, 2013 at 11:17 AM, Sebastian Schelter <[email protected]> wrote:

> Hi Johannes,
>
> can you share some details about the dataset that you ran streaming
> k-means on (number of datapoints, cardinality, etc)?
>
> @Ted/Suneel Shouldn't the approximate searching techniques (e.g.
> projection search) help cope with high dimensional inputs?
>
> --sebastian
>
>
> On 25.12.2013 10:42, Johannes Schulte wrote:
> > Hi,
> >
> > i also had problems getting up to speed but i made the cardinality of the
> > vectors responsible for that. i didn't do the math exactly but while
> > streaming k-means improves over regular k-means in using log(k) and
> > (n_umber of datapoints / k) passes, the d_imension parameter from the
> > original k*d*n stays untouched, right?
> >
> > What is your vector's cardinality?
> >
> >
> > On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi <[email protected]
> >wrote:
> >
> >> Ted,
> >>
> >> What were the CLI parameters when you ran this test for 1M points - no.
> of
> >> clusters k, km, distanceMeasure, projectionSearch,
> estimatedDistanceCutoff?
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <
> [email protected]>
> >> wrote:
> >>
> >> For reference, on a 16 core machine, I was able to run the sequential
> >> version of streaming k-means on 1,000,000 points, each with 10
> dimensions
> >> in about 20 seconds.  The map-reduce versions are comparable subject to
> >> scaling except for startup time.
> >>
> >>
> >>
> >> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <[email protected]>
> >> wrote:
> >>
> >>> That the algorithm runs a single reducer is expected. The algorithm
> >>> creates a sketch of the data in parallel in the map-phase, which is
> >>> collected by the reducer afterwards. The reducer then applies an
> >>> expensive in-memory clustering algorithm to the sketch.
> >>>
> >>> Which dataset are you using for testing? I can also do some tests on a
> >>> cluster here.
> >>>
> >>> I can imagine two possible causes for the problems: Maybe there's a
> >>> problem with the vectors and some calculations take very long because
> >>> the wrong access pattern or implementation is chosen.
> >>>
> >>> Another problem could be that the mappers and reducers have too few
> >>> memory and spend a lot of time running garbage collections.
> >>>
> >>> --sebastian
> >>>
> >>>
> >>> On 23.12.2013 22:14, Suneel Marthi wrote:
> >>>> Has anyone be successful running Streaming KMeans clustering on a
> large
> >>> dataset (> 100,000 points)?
> >>>>
> >>>>
> >>>> It just seems to take a very long time (> 4hrs) for the mappers to
> >>> finish on about 300K data points and the reduce phase has only a single
> >>> reducer running and throws an OOM failing the job several hours after
> the
> >>> job has been kicked off.
> >>>>
> >>>> Its the same story when trying to run in sequential mode.
> >>>>
> >>>> Looking at the code the bottleneck seems to be in
> >>> StreamingKMeans.clusterInternal(), without understanding the behaviour
> of
> >>> the algorithm I am not sure if the sequence of steps in there is
> correct.
> >>>>
> >>>>
> >>>> There are few calls that call themselves repeatedly over and over
> again
> >>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
> >>>>
> >>>> We really need to have this working on datasets that are larger than
> >> 20K
> >>> reuters datasets.
> >>>>
> >>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
> >>> FastProjectSearch.
> >>>>
> >>>
> >>>
> >>
> >
>
>

Re: Streaming KMeans clustering

Reply via email to