Re: Validating clustering output

Grant Ingersoll Wed, 17 Jun 2009 06:33:32 -0700


On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:

A principled approach to cluster evaluation is to measure how well the
cluster membership captures the structure of unseen data. A naturalmeasurefor this is to measure how much of the entropy of the data iscaptured bycluster membership. For k-means and its natural L_2 metric, thenaturalcluster quality metric is the squared distance from the nearestcentroidadjusted by the log_2 of the number of clusters. This can becompared tothe squared magnitude of the original data or the squared deviationfrom the
centroid for all of the data.  The idea is that you are changing the
representation of the data by allocating some of the bits in youroriginalrepresentation to represent which cluster each point is in. Ifthose bitsaren't made up by the residue being small then your clustering ismaking a
bad trade-off.
In the past, I have used other more heuristic measures as well. Oneof thekey characteristics that I would like to see out of a clustering isa degreeof stability. Thus, I look at the fractions of points that areassigned toeach cluster or the distribution of distances from the clustercentroid.These values should be relatively stable when applied to held-outdata.
For text, you can actually compute perplexity which measures how well
cluster membership predicts what words are used. This is nicebecause you
don't have to worry about the entropy of real valued numbers.

OK, so how do we go about codifying this stuff? Is there existingcode that we could use or is it worth us writing our own?

Some references would be good here, too. Feel free to add to http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData. (I've already linked this conversation, but will probably cut andpaste some of it too.

Manual inspection and the so-called laugh test is also important.The idea
is that the results should not be so ludicrous as to make you laugh.
Unfortunately, it is pretty easy to kid yourself into thinking yoursystemis working using this kind of inspection. The problem is that weare too
good at seeing (making up) patterns.

I think this is where the new Open Relevance Project can come in,too. Judgments, etc. ain't just for search!

On Tue, Jun 16, 2009 at 2:35 PM, Grant Ingersoll<[email protected]>wrote:
What tools/approaches are people using to validate their clusteringoutput?Are there utilities that we should be implementing that would makethis
easier for users?


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Validating clustering output

Reply via email to