Hi Dan, Sure. I took a quick look just now and it looks good. Did you notice that shuffling before collapsing was helping, hence keeping it in? It didn't make much difference for me.
Andy On 9 May 2013 16:05, Dan Filimon <[email protected]> wrote: > Andy, would you like to review the final version of the clustering code > before it goes in [1]? > [1] https://reviews.apache.org/r/10194/ > > Ted, it's pretty much done. Okay it and I'll commit. > > > On Wed, May 8, 2013 at 11:57 PM, Ted Dunning <[email protected]>wrote: > >> On Wed, May 8, 2013 at 10:28 AM, Dan Filimon <[email protected] >> >wrote: >> >> > > > I think it avoids the need of the special way we handle the >> increase of >> > > > distanceCutoff by beta in another if. >> > > > >> > > >> > > Sure. Sounds right and all. >> > > >> > > But experiment will tell better. >> > >> >> yes. >> >> But I definitely saw cases where the same cutoff caused the centroid count >> to decrease. In my mind, continuing to increase the cutoff in those cases >> is a bad thing. A smaller cutoff is more conservative in that it will >> preserve more data in the sketch. Until we see it preserving too much >> data, we don't need to increase the cutoff. >> > > I kept the overshoot just to be safe in the CL. > > > > > ... They >> > > > actually call it a "facility cost" rather than a distance, probably >> for >> > > > this reason. >> > >> >> Btw... the reason that they call it a facility cost is because they are >> referring to a different literature. With k-means, k is traditionally >> fixed. With facility assignment, it is traditionally not. The problems >> are otherwise quite similar. The reason for the difference in >> nomenclature >> is because the facility assignment stuff comes from operations research, >> not computer science. >> > > Ah, well that explains it. :) > > ... I'm uncomfortable with the distanceCutoff growing too high, but I'll >> > just >> > put the blame on that one on the data. >> > >> >> I am uncomfortable as well. >> >> This is one reason I would like to only increase the distanceCutoff when a >> small value proves ineffective. > > > Alright, this is the version that's going in. > > >> > StreamingKMeans + BallKMeans gave good results compared to Mahout >> KMeans on >> > other data sets (similar kinds of clusters and good looking Dunn and >> > Davies-Bouldin indices). >> > >> >> You hide this gem in a long email!!! >> >> Good news. > > > Yeah. :) > It's comparable to Mahout KMeans quality wise, and very tweakable. > The speed improvements should be apparent on large data sets that we run > on Hadoop. > > > > >> > >> > > The estimate we give it at the beginning is only valid as long as not >> > > > enough datapoints have been processed to go over k log n. >> > > > >> > > >> > > Are we talking about clusterOvershoot here? Or the numClusters >> > over-ride? >> > >> > >> > We collapse the clusters when the number of actual centroids is over >> > clusterOvershoot * numClusters. >> > I'm thinking that since numClusters increases anyway, clusterOvershoot >> > means we end up with more clusters than we need (not bad per se, but >> trying >> > to get rid of variables). >> > >> >> I view it as numClusters is the minimum number of clusters that we want to >> see. ClusterOverShoot says that we can go a ways above the minimum, but >> we >> hopefully will just collapse back down to the minimum or above. >> >> >> >> > > Well, we have seen cases where the over-shoot needed to be >1. Those >> may >> > > have gone away with better adaptation, but I think that they probably >> > still >> > > can happen. >> > > >> > >> > Sorry, what do you mean by adaptation here? >> > >> >> Better adjustment and use of the distanceCutoff. This should make the >> collapse in the recursive clustering be less dramatic and more >> predictable. >> That will make the system require less over-shoot. >> > > -- Dr Andy Twigg Junior Research Fellow, St Johns College, Oxford Room 351, Department of Computer Science http://www.cs.ox.ac.uk/people/andy.twigg/ [email protected] | +447799647538
