Hi Ted!

I recently talked to one of the authors of streaming k-means, Adam Meyerson
asking about the distance cutoff as I wasn't sure of a right value for this.

He told me two things:
- that we should multiply the distance / distanceCutoff ratio by the weight
of the point we're trying to cluster so as to avoid collapsing larger
clusters
- the initial cutoff they use is 1 / numClusters basically

As I tested the code on multiple well known data sets, this got me thinking
of removing the distanceCutoff all together.
It seems like just another parameter to get right with only limited real
value of fiddling with it.

Additionally, clusterOvershoot, the thing we're using to delay distance
cutoff increases also seems somewhat unnecessary. Why use it and get a lot
more centroids than what we asked for.

I want to post a final version for review, but I just wanted to mention
these two things.

It's not like they "hurt" really, they just don't seem to be helping too
much and I'd rather have something that more closely matches the
theoretical guarantees in the paper.

What do you think?

Reply via email to