Just to add a little bit based on some research I've been doing on the subject, 
it seems there are several techniques for naming clusters, ranging from the 
mundane to the intricate:

1. Top terms based on weight (e.g. TFIDF) -- Implemented in Mahout in the 
ClusterDumper - Just sort the top terms across the docs in the cluster and spit 
out some subset
2. Log-likelihood Ratio (LLR) - Implemented in Mahout in ClusterLabels, 
currently requires Lucene index, but could be modified - Calculates the 
log-likelihood for the terms in the vectors and then sorts based on the value 
and spits out the labels
3. Some type of LSA/SVD approach - Implemented in Carrot2, others - Identify 
the concepts by taking the SVD of the vectors and then determine/use the base 
concepts derived from that
4. Frequent Phrases using something like a suffix tree or other phrase 
detection methods - Implemented in in Carrot2 (Suffix Tree Clustering) and 
others - finds frequent phrases and sorts them based on a weight to return

I'm probably missing some other approaches so feel free to fill in, but those 
are what I've come across so far.

-Grant

On Jan 3, 2010, at 3:07 PM, Ted Dunning wrote:

> Good thing to do.
> 
> Slightly tricky to do.  But worthy.
> 
> On Sun, Jan 3, 2010 at 11:04 AM, Grant Ingersoll <[email protected]>wrote:
> 
>> My first thought is to just create an n-gram model of the same field I'm
>> clustering on (as that will allow the existing code to work unmodified), but
>> I wanted to hear what others think.  Is it worth the time?
>> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve


Reply via email to