Re: Streaming KMeans clustering

Suneel Marthi Wed, 25 Dec 2013 06:51:31 -0800

Not sure how that would work in a corporate setting wherein there's a fixed 
systemwide setting that cannot be overridden.


Sent from my iPhone

> On Dec 25, 2013, at 9:44 AM, Sebastian   Schelter <s...@apache.org> wrote:
> 
>> On 25.12.2013 14:19, Suneel Marthi wrote:
>> 
>> 
>> 
>> 
>> 
>>>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <ted.dunn...@gmail.com> 
>>>> wrote:
>> 
>>>> For reference, on a 16 core machine, I was able to run the sequential
>>>> version of streaming k-means on 1,000,000 points, each with 10 dimensions
>>>> in about 20 seconds.  The map-reduce versions are comparable subject to
>>>> scaling except for startup time.
>> 
>> @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not 
>> sure how this would have even worked for u in sequential mode in light of 
>> the issues reported against M-1314, M-1358, M-1380 (all of which impact the 
>> sequential mode); unless u had fixed them locally.
>> What were ur estimatedDistanceCutoff, number of clusters 'k', projection 
>> search and how much memory did u have to allocate to the single Reducer?
> 
> If I read the source code correctly, the final reducer clusters the
> sketch which should contain m * k * log n intermediate centroids, where
> k is the number of desired clusters, m is the number of mappers run and
> n is the number of datapoints. Those centroids are expected to be dense,
> so we can estimate the memory required for the final reducer using this
> formula.
> 
>> 
>> 
>> 
>> 
>>> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <s...@apache.org> wrote:
>>> 
>>> That the algorithm runs a single reducer is expected. The algorithm
>>> creates a sketch of
>> the data in parallel in the map-phase, which is
>>> collected by the reducer afterwards. The reducer then applies an
>>> expensive in-memory clustering algorithm to the sketch.
>>> 
>>> Which dataset are you using for testing? I can also do some tests on a
>>> cluster here.
>>> 
>>> I can imagine two possible causes for the problems: Maybe there's a
>>> problem with the vectors and some calculations take very long because
>>> the wrong access pattern or implementation is chosen.
>>> 
>>> Another problem could be that the mappers and reducers have too few
>>> memory and spend a lot of time running garbage collections.
>>> 
>>> --sebastian
>>> 
>>> 
>>> On 23.12.2013 22:14,
>> Suneel Marthi wrote:
>>>> Has anyone be successful running Streaming KMeans clustering on a large
>>> dataset (> 100,000 points)?
>>>> 
>>>> 
>>>> It just seems to take a very long time (> 4hrs) for the mappers to
>>> finish on about 300K data points and the reduce phase has only a single
>>> reducer running and throws an OOM failing the job several hours after the
>>> job has been kicked off.
>>>> 
>>>> Its the same story when trying to run in sequential mode.
>>>> 
>>>> Looking at the code the bottleneck seems to be in
>>> StreamingKMeans.clusterInternal(), without understanding the behaviour of
>>> the algorithm I am not sure if the sequence of steps in there is correct.
>>>> 
>>>> 
>>>> There are few calls that call themselves repeatedly over and over again
>>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
>>>> 
>>>> We really need to have this working on datasets that are larger than 20K
>>> reuters datasets.
>>>> 
>>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
>>> FastProjectSearch.
>

Re: Streaming KMeans clustering

Reply via email to