I'm trying to automate something like a hierarchical clustering and so looking for a good quality metric. I can see no way to automate from the numbers I just got but it's a start. It was for a very small data set.
You mention looking at intra-cluster average distance with held out data. Held-out, I assume, means it was not used to calculate centroids or in determining cluster membership. Are you proposing remeasuring the average distance from the closest centroid for these held-out docs? Averaging together the ones that are closest to the same centroid, then averaging the averages for all clusters? I don't think I've heard of this before. Seems interesting is there a paper? On May 21, 2013, at 9:53 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: On Tue, May 21, 2013 at 8:47 PM, Pat Ferrel <pat.fer...@gmail.com> wrote: > For this sample it looks like about 20-40 clusters is "best"? Looking at > the results for k=40 by eyeball they do seem pretty good. It is really hard to tell with these numbers. IN spite of their heritage, these scaled average distances are kind of debatable as things to compare, if only because they are scaled differently. My own tendency is to prefer to use unscaled intra-cluster average distance. This should monotonically decrease as k increases. The interesting question (for me) is what the same average is for held-out data. This measure of quality is focused around the use of clustering as a feature for downstream modeling, not necessarily for human consumption.