Inline
On Wed, May 8, 2013 at 8:49 AM, Andy Twigg <[email protected]> wrote: > both of those make sense to me. > > > > On 8 May 2013 16:45, Dan Filimon <[email protected]> wrote: > >> Hi Ted! >> >> I recently talked to one of the authors of streaming k-means, Adam >> Meyerson >> asking about the distance cutoff as I wasn't sure of a right value for >> this. >> >> He told me two things: >> - that we should multiply the distance / distanceCutoff ratio by the >> weight >> of the point we're trying to cluster so as to avoid collapsing larger >> clusters >> > This makes sense, but I have no idea what the effect will be. > - the initial cutoff they use is 1 / numClusters basically >> > This is just wrong. It is fine in theoretical settings with synthetic clusters of unit characteristic scale, but is not invariant over uniform scaling so it can't be correct. > >> As I tested the code on multiple well known data sets, this got me >> thinking >> of removing the distanceCutoff all together. >> It seems like just another parameter to get right with only limited real >> value of fiddling with it. >> > I think it is core to the algorithm. It is adapted to the data in any case. >> Additionally, clusterOvershoot, the thing we're using to delay distance >> cutoff increases also seems somewhat unnecessary. Why use it and get a lot >> more centroids than what we asked for. >> > Well, it needs to be there until we get more run-time. Getting more centroids doesn't hurt and getting too few definitely does hurt. Thus we need some insurance. > >> I want to post a final version for review, but I just wanted to mention >> these two things. >> >> It's not like they "hurt" really, they just don't seem to be helping too >> much and I'd rather have something that more closely matches the >> theoretical guarantees in the paper. >> >> What do you think? >> > I think we already do better than some of the guarantees in the paper.
