On Wed, May 8, 2013 at 10:00 AM, Dan Filimon <dangeorge.fili...@gmail.com>wrote:
> On Wed, May 8, 2013 at 7:48 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > Inline > > > > > > >> He told me two things: > > >> - that we should multiply the distance / distanceCutoff ratio by the > > >> weight > > >> of the point we're trying to cluster so as to avoid collapsing larger > > >> clusters > > >> > > > > > This makes sense, but I have no idea what the effect will be. > > > > I think it avoids the need of the special way we handle the increase of > distanceCutoff by beta in another if. > Sure. Sounds right and all. But experiment will tell better. > > > - the initial cutoff they use is 1 / numClusters basically > > >> > > > > > This is just wrong. > > > > It is fine in theoretical settings with synthetic clusters of unit > > characteristic scale, but is not invariant over uniform scaling so it > can't > > be correct. > > > But the thing is it's not really a distance at all. I've seen it increase > to far beyond what a maximum distance between clusters should be. > So, for example, in a case where the maximum distance is 2, it went all the > way up to 14. :) > > It's only a distance conceptually but it's more like a "factor". They > actually call it a "facility cost" rather than a distance, probably for > this reason. > It has units of distance and thus should be invariant with respect to uniform rescaling. No way around that. That means that it has to be based on a measurement from the data. And since it is a probabilistic bound, it is OK if it exceeds other distances. 14 versus 2 does seem a bit high, but conceptually exceeding the distance is not a problem. I think that the 14 versus 2 problem is something else. > > >> Additionally, clusterOvershoot, the thing we're using to delay > distance > > >> cutoff increases also seems somewhat unnecessary. Why use it and get a > > lot > > >> more centroids than what we asked for. > > >> > > > > > Well, it needs to be there until we get more run-time. Getting more > > centroids doesn't hurt and getting too few definitely does hurt. Thus we > > need some insurance. > > > Well, it would get as many clusters as it needs to. And we increase > numClusters as we go until it's clusterLogFactor * numProcessedDatapoints. > I understand the point that we should get more rather than less centroids, > but k log n (what we're getting at the end) seems fine. > The key is not the number at the end. It is that the number never drops below enough centroids. > The estimate we give it at the beginning is only valid as long as not > enough datapoints have been processed to go over k log n. > Are we talking about clusterOvershoot here? Or the numClusters over-ride? > I think we already do better than some of the guarantees in the paper. > > > > I think we do the same, and I'm wondering whether the extra knobs do > anything or not. I tried fiddling with them a bit and they didn't seem to > do much. > I also haven't worked through the math for our particular version. > Well, we have seen cases where the over-shoot needed to be >1. Those may have gone away with better adaptation, but I think that they probably still can happen. The distanceCutoff initialization is blatantly important if you try scaling the test data by 1e-9. Without adapting to the data, you will get exactly one cluster.