Re: Streaming KMeans clustering

Sebastian Schelter Wed, 25 Dec 2013 02:19:11 -0800

Hi Johannes,

can you share some details about the dataset that you ran streaming
k-means on (number of datapoints, cardinality, etc)?


@Ted/Suneel Shouldn't the approximate searching techniques (e.g.
projection search) help cope with high dimensional inputs?

--sebastian


On 25.12.2013 10:42, Johannes Schulte wrote:
> Hi,
> 
> i also had problems getting up to speed but i made the cardinality of the
> vectors responsible for that. i didn't do the math exactly but while
> streaming k-means improves over regular k-means in using log(k) and
> (n_umber of datapoints / k) passes, the d_imension parameter from the
> original k*d*n stays untouched, right?
> 
> What is your vector's cardinality?
> 
> 
> On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi <[email protected]>wrote:
> 
>> Ted,
>>
>> What were the CLI parameters when you ran this test for 1M points - no. of
>> clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff?
>>
>>
>>
>>
>>
>>
>>
>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <[email protected]>
>> wrote:
>>
>> For reference, on a 16 core machine, I was able to run the sequential
>> version of streaming k-means on 1,000,000 points, each with 10 dimensions
>> in about 20 seconds.  The map-reduce versions are comparable subject to
>> scaling except for startup time.
>>
>>
>>
>> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <[email protected]>
>> wrote:
>>
>>> That the algorithm runs a single reducer is expected. The algorithm
>>> creates a sketch of the data in parallel in the map-phase, which is
>>> collected by the reducer afterwards. The reducer then applies an
>>> expensive in-memory clustering algorithm to the sketch.
>>>
>>> Which dataset are you using for testing? I can also do some tests on a
>>> cluster here.
>>>
>>> I can imagine two possible causes for the problems: Maybe there's a
>>> problem with the vectors and some calculations take very long because
>>> the wrong access pattern or implementation is chosen.
>>>
>>> Another problem could be that the mappers and reducers have too few
>>> memory and spend a lot of time running garbage collections.
>>>
>>> --sebastian
>>>
>>>
>>> On 23.12.2013 22:14, Suneel Marthi wrote:
>>>> Has anyone be successful running Streaming KMeans clustering on a large
>>> dataset (> 100,000 points)?
>>>>
>>>>
>>>> It just seems to take a very long time (> 4hrs) for the mappers to
>>> finish on about 300K data points and the reduce phase has only a single
>>> reducer running and throws an OOM failing the job several hours after the
>>> job has been kicked off.
>>>>
>>>> Its the same story when trying to run in sequential mode.
>>>>
>>>> Looking at the code the bottleneck seems to be in
>>> StreamingKMeans.clusterInternal(), without understanding the behaviour of
>>> the algorithm I am not sure if the sequence of steps in there is correct.
>>>>
>>>>
>>>> There are few calls that call themselves repeatedly over and over again
>>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
>>>>
>>>> We really need to have this working on datasets that are larger than
>> 20K
>>> reuters datasets.
>>>>
>>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
>>> FastProjectSearch.
>>>>
>>>
>>>
>>
>

Re: Streaming KMeans clustering

Reply via email to