Re: Streaming KMeans distance cutoff

Andy Twigg Wed, 08 May 2013 08:50:35 -0700

both of those make sense to me.



On 8 May 2013 16:45, Dan Filimon <[email protected]> wrote:

> Hi Ted!
>
> I recently talked to one of the authors of streaming k-means, Adam Meyerson
> asking about the distance cutoff as I wasn't sure of a right value for
> this.
>
> He told me two things:
> - that we should multiply the distance / distanceCutoff ratio by the weight
> of the point we're trying to cluster so as to avoid collapsing larger
> clusters
> - the initial cutoff they use is 1 / numClusters basically
>
> As I tested the code on multiple well known data sets, this got me thinking
> of removing the distanceCutoff all together.
> It seems like just another parameter to get right with only limited real
> value of fiddling with it.
>
> Additionally, clusterOvershoot, the thing we're using to delay distance
> cutoff increases also seems somewhat unnecessary. Why use it and get a lot
> more centroids than what we asked for.
>
> I want to post a final version for review, but I just wanted to mention
> these two things.
>
> It's not like they "hurt" really, they just don't seem to be helping too
> much and I'd rather have something that more closely matches the
> theoretical guarantees in the paper.
>
> What do you think?
>



-- 
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538

Re: Streaming KMeans distance cutoff

Reply via email to