Just to add a little bit based on some research I've been doing on the subject, it seems there are several techniques for naming clusters, ranging from the mundane to the intricate:
1. Top terms based on weight (e.g. TFIDF) -- Implemented in Mahout in the ClusterDumper - Just sort the top terms across the docs in the cluster and spit out some subset 2. Log-likelihood Ratio (LLR) - Implemented in Mahout in ClusterLabels, currently requires Lucene index, but could be modified - Calculates the log-likelihood for the terms in the vectors and then sorts based on the value and spits out the labels 3. Some type of LSA/SVD approach - Implemented in Carrot2, others - Identify the concepts by taking the SVD of the vectors and then determine/use the base concepts derived from that 4. Frequent Phrases using something like a suffix tree or other phrase detection methods - Implemented in in Carrot2 (Suffix Tree Clustering) and others - finds frequent phrases and sorts them based on a weight to return I'm probably missing some other approaches so feel free to fill in, but those are what I've come across so far. -Grant On Jan 3, 2010, at 3:07 PM, Ted Dunning wrote: > Good thing to do. > > Slightly tricky to do. But worthy. > > On Sun, Jan 3, 2010 at 11:04 AM, Grant Ingersoll <[email protected]>wrote: > >> My first thought is to just create an n-gram model of the same field I'm >> clustering on (as that will allow the existing code to work unmodified), but >> I wanted to hear what others think. Is it worth the time? >> > > > > -- > Ted Dunning, CTO > DeepDyve
