OK, seems pretty simple.

Was there a paper attached?

On May 24, 2013, at 4:07 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

Yes.  that is the idea.

But I would drop the average of averages.  Use squared distance.

Just average all (or enough to get an estimate) of the distances to the
nearest centroid.

This is proportional to log-likelihood (with an offset) for the mixture of
Gaussian model that underlies k-means clustering.

See this paper for a use of mean squared distance to nearest cluster.


On Fri, May 24, 2013 at 9:46 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

> I'm trying to automate something like a hierarchical clustering and so
> looking for a good quality metric. I can see no way to automate from the
> numbers I just got but it's a start. It was for a very small data set.
> 
> You mention looking at intra-cluster average distance with held out data.
> Held-out, I assume, means it was not used to calculate centroids or in
> determining cluster membership. Are you proposing remeasuring the average
> distance from the closest centroid for these held-out docs? Averaging
> together the ones that are closest to the same centroid, then averaging the
> averages for all clusters?
> 
> I don't think I've heard of this before. Seems interesting is there a
> paper?
> 
> On May 21, 2013, at 9:53 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
> On Tue, May 21, 2013 at 8:47 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:
> 
>> For this sample it looks like about 20-40 clusters is "best"? Looking at
>> the results for k=40 by eyeball they do seem pretty good.
> 
> 
> It is really hard to tell with these numbers.  IN spite of their heritage,
> these scaled average distances are kind of debatable as things to compare,
> if only because they are scaled differently.
> 
> My own tendency is to prefer to use unscaled intra-cluster average
> distance.  This should monotonically decrease as k increases.  The
> interesting question (for me) is what the same average is for held-out
> data.
> 
> This measure of quality is focused around the use of clustering as a
> feature for downstream modeling, not necessarily for human consumption.
> 
> 

Reply via email to