Re: Streaming KMeans distance cutoff

Ted Dunning Wed, 08 May 2013 09:49:48 -0700

Inline


On Wed, May 8, 2013 at 8:49 AM, Andy Twigg <[email protected]> wrote:

> both of those make sense to me.
>
>
>
> On 8 May 2013 16:45, Dan Filimon <[email protected]> wrote:
>
>> Hi Ted!
>>
>> I recently talked to one of the authors of streaming k-means, Adam
>> Meyerson
>> asking about the distance cutoff as I wasn't sure of a right value for
>> this.
>>
>> He told me two things:
>> - that we should multiply the distance / distanceCutoff ratio by the
>> weight
>> of the point we're trying to cluster so as to avoid collapsing larger
>> clusters
>>
>
This makes sense, but I have no idea what the effect will be.


>  - the initial cutoff they use is 1 / numClusters basically
>>
>
This is just wrong.

It is fine in theoretical settings with synthetic clusters of unit
characteristic scale, but is not invariant over uniform scaling so it can't
be correct.




>
>> As I tested the code on multiple well known data sets, this got me
>> thinking
>> of removing the distanceCutoff all together.
>> It seems like just another parameter to get right with only limited real
>> value of fiddling with it.
>>
>
I think it is core to the algorithm.  It is adapted to the data in any
case.


>> Additionally, clusterOvershoot, the thing we're using to delay distance
>> cutoff increases also seems somewhat unnecessary. Why use it and get a lot
>> more centroids than what we asked for.
>>
>
Well, it needs to be there until we get more run-time.  Getting more
centroids doesn't hurt and getting too few definitely does hurt.  Thus we
need some insurance.


>
>> I want to post a final version for review, but I just wanted to mention
>> these two things.
>>
>> It's not like they "hurt" really, they just don't seem to be helping too
>> much and I'd rather have something that more closely matches the
>> theoretical guarantees in the paper.
>>
>> What do you think?
>>
>
I think we already do better than some of the guarantees in the paper.

Re: Streaming KMeans distance cutoff

Reply via email to