Hi Ted! I recently talked to one of the authors of streaming k-means, Adam Meyerson asking about the distance cutoff as I wasn't sure of a right value for this.
He told me two things: - that we should multiply the distance / distanceCutoff ratio by the weight of the point we're trying to cluster so as to avoid collapsing larger clusters - the initial cutoff they use is 1 / numClusters basically As I tested the code on multiple well known data sets, this got me thinking of removing the distanceCutoff all together. It seems like just another parameter to get right with only limited real value of fiddling with it. Additionally, clusterOvershoot, the thing we're using to delay distance cutoff increases also seems somewhat unnecessary. Why use it and get a lot more centroids than what we asked for. I want to post a final version for review, but I just wanted to mention these two things. It's not like they "hurt" really, they just don't seem to be helping too much and I'd rather have something that more closely matches the theoretical guarantees in the paper. What do you think?
