On Wed, May 8, 2013 at 10:28 AM, Dan Filimon <dangeorge.fili...@gmail.com>wrote:
> > > I think it avoids the need of the special way we handle the increase of > > > distanceCutoff by beta in another if. > > > > > > > Sure. Sounds right and all. > > > > But experiment will tell better. > yes. But I definitely saw cases where the same cutoff caused the centroid count to decrease. In my mind, continuing to increase the cutoff in those cases is a bad thing. A smaller cutoff is more conservative in that it will preserve more data in the sketch. Until we see it preserving too much data, we don't need to increase the cutoff. > > > ... They > > > actually call it a "facility cost" rather than a distance, probably for > > > this reason. > Btw... the reason that they call it a facility cost is because they are referring to a different literature. With k-means, k is traditionally fixed. With facility assignment, it is traditionally not. The problems are otherwise quite similar. The reason for the difference in nomenclature is because the facility assignment stuff comes from operations research, not computer science. ... I'm uncomfortable with the distanceCutoff growing too high, but I'll > just > put the blame on that one on the data. > I am uncomfortable as well. This is one reason I would like to only increase the distanceCutoff when a small value proves ineffective. > StreamingKMeans + BallKMeans gave good results compared to Mahout KMeans on > other data sets (similar kinds of clusters and good looking Dunn and > Davies-Bouldin indices). > You hide this gem in a long email!!! Good news. > > > > > The estimate we give it at the beginning is only valid as long as not > > > enough datapoints have been processed to go over k log n. > > > > > > > Are we talking about clusterOvershoot here? Or the numClusters > over-ride? > > > We collapse the clusters when the number of actual centroids is over > clusterOvershoot * numClusters. > I'm thinking that since numClusters increases anyway, clusterOvershoot > means we end up with more clusters than we need (not bad per se, but trying > to get rid of variables). > I view it as numClusters is the minimum number of clusters that we want to see. ClusterOverShoot says that we can go a ways above the minimum, but we hopefully will just collapse back down to the minimum or above. > > Well, we have seen cases where the over-shoot needed to be >1. Those may > > have gone away with better adaptation, but I think that they probably > still > > can happen. > > > > Sorry, what do you mean by adaptation here? > Better adjustment and use of the distanceCutoff. This should make the collapse in the recursive clustering be less dramatic and more predictable. That will make the system require less over-shoot.