both of those make sense to me.
On 8 May 2013 16:45, Dan Filimon <dangeorge.fili...@gmail.com> wrote: > Hi Ted! > > I recently talked to one of the authors of streaming k-means, Adam Meyerson > asking about the distance cutoff as I wasn't sure of a right value for > this. > > He told me two things: > - that we should multiply the distance / distanceCutoff ratio by the weight > of the point we're trying to cluster so as to avoid collapsing larger > clusters > - the initial cutoff they use is 1 / numClusters basically > > As I tested the code on multiple well known data sets, this got me thinking > of removing the distanceCutoff all together. > It seems like just another parameter to get right with only limited real > value of fiddling with it. > > Additionally, clusterOvershoot, the thing we're using to delay distance > cutoff increases also seems somewhat unnecessary. Why use it and get a lot > more centroids than what we asked for. > > I want to post a final version for review, but I just wanted to mention > these two things. > > It's not like they "hurt" really, they just don't seem to be helping too > much and I'd rather have something that more closely matches the > theoretical guarantees in the paper. > > What do you think? > -- Dr Andy Twigg Junior Research Fellow, St Johns College, Oxford Room 351, Department of Computer Science http://www.cs.ox.ac.uk/people/andy.twigg/ andy.tw...@cs.ox.ac.uk | +447799647538