I'm trying to automate something like a hierarchical clustering and so looking 
for a good quality metric. I can see no way to automate from the numbers I just 
got but it's a start. It was for a very small data set.

You mention looking at intra-cluster average distance with held out data. 
Held-out, I assume, means it was not used to calculate centroids or in 
determining cluster membership. Are you proposing remeasuring the average 
distance from the closest centroid for these held-out docs? Averaging together 
the ones that are closest to the same centroid, then averaging the averages for 
all clusters?

I don't think I've heard of this before. Seems interesting is there a paper? 

On May 21, 2013, at 9:53 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

On Tue, May 21, 2013 at 8:47 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:

> For this sample it looks like about 20-40 clusters is "best"? Looking at
> the results for k=40 by eyeball they do seem pretty good.


It is really hard to tell with these numbers.  IN spite of their heritage,
these scaled average distances are kind of debatable as things to compare,
if only because they are scaled differently.

My own tendency is to prefer to use unscaled intra-cluster average
distance.  This should monotonically decrease as k increases.  The
interesting question (for me) is what the same average is for held-out data.

This measure of quality is focused around the use of clustering as a
feature for downstream modeling, not necessarily for human consumption.

Reply via email to