OK, seems pretty simple. Was there a paper attached?
On May 24, 2013, at 4:07 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: Yes. that is the idea. But I would drop the average of averages. Use squared distance. Just average all (or enough to get an estimate) of the distances to the nearest centroid. This is proportional to log-likelihood (with an offset) for the mixture of Gaussian model that underlies k-means clustering. See this paper for a use of mean squared distance to nearest cluster. On Fri, May 24, 2013 at 9:46 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > I'm trying to automate something like a hierarchical clustering and so > looking for a good quality metric. I can see no way to automate from the > numbers I just got but it's a start. It was for a very small data set. > > You mention looking at intra-cluster average distance with held out data. > Held-out, I assume, means it was not used to calculate centroids or in > determining cluster membership. Are you proposing remeasuring the average > distance from the closest centroid for these held-out docs? Averaging > together the ones that are closest to the same centroid, then averaging the > averages for all clusters? > > I don't think I've heard of this before. Seems interesting is there a > paper? > > On May 21, 2013, at 9:53 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > On Tue, May 21, 2013 at 8:47 PM, Pat Ferrel <pat.fer...@gmail.com> wrote: > >> For this sample it looks like about 20-40 clusters is "best"? Looking at >> the results for k=40 by eyeball they do seem pretty good. > > > It is really hard to tell with these numbers. IN spite of their heritage, > these scaled average distances are kind of debatable as things to compare, > if only because they are scaled differently. > > My own tendency is to prefer to use unscaled intra-cluster average > distance. This should monotonically decrease as k increases. The > interesting question (for me) is what the same average is for held-out > data. > > This measure of quality is focused around the use of clustering as a > feature for downstream modeling, not necessarily for human consumption. > >