> 1. Top terms based on weight (e.g. TFIDF) -- Implemented in Mahout in the > ClusterDumper - Just sort the top terms across the docs in the cluster and > spit out some subset > 2. Log-likelihood Ratio (LLR) - Implemented in Mahout in ClusterLabels, > currently requires Lucene index, but could be modified - Calculates the > log-likelihood for the terms in the vectors and then sorts based on the value > and spits out the labels > 3. Some type of LSA/SVD approach - Implemented in Carrot2, others - Identify > the concepts by taking the SVD of the vectors and then determine/use the base > concepts derived from that > 4. Frequent Phrases using something like a suffix tree or other phrase > detection methods - Implemented in in Carrot2 (Suffix Tree Clustering) and > others - finds frequent phrases and sorts them based on a weight to return
Just a short follow-up note to this discussion. The major difference is perhaps not in how you select labels, but how you approach clustering AND labelling together -- whether you want to find labels for existing groups of items or (which is what Carrot2 does) whether you want to find good potential cluster labels first and then reject (or select) these labels that actually correspond to the groups of documents found in the input. The actual algorithmic method follows either of these paradigms. I realize the difference may seem small at first, but it has huge consequences in what you consider a cluster, clustering quality, etc. Consider these facts: - Clusters for which there are no good labels can (and probably should) be discarded. We simply cannot convey their meaning to the user, so why bother. - In the second approach the structure of clusters is strongly affected by the labels and the mechanism of assigning documents to these labels. If you keep it strict (e.g. "every item in the cluster must contain the cluster's exact phrase"), then the relationship between a cluster's label and its content will directly convey something to the user. If you allow some fuzziness (e.g., "every iterm in the cluster must contain at least one word from the set of keywords forming the cluster's label"), then the label is only a hint and the user must still come up with the "glue" (whatever it is that caused these particular keywords to be grouped together). What's worse -- neither method is "better". We at Carrot2 have a strong feeling that clusters should be described properly in order to be useful, but one may argue that in many, many applications of clustering, the labels are _not_ important and just individual features of clusters (like keywords or even documents themselves) are enough. Back to the labelling methods Grant described -- STC and Lingo (from Carrot2) are representatives of the "labels come first" apprach and their usefulness will be limited in labelling existing document clusters. STC, for example, is basically a complex feature extraction algorithm (extraction of frequent phrases using suffix trees), followed by a simple single-link greedy heuristic merging the document sets implied by these features. Whether this approach can be scaled up to labelling existing clusters -- I don't know. What is certainly worth investigation is extraction of better features. n-grams are a good candidate (especially with pruning techniques). Suffix arrays could be used to speed up the process to some extent (batching a good sample of documents in a single mapper, running frequent phrase detection, emitting only frequent phrases that exceed a given threshold in the batch). If you'd like to experiment with this approach, suffix tree code is in Carrot2 and suffix arrays are here: http://www.jsuffixarrays.org I don't know if it helps in any way. Apologies if I've been inconsistent in any of the above -- had about a dozen urgent (and I mean: urgent) interrupts from my two-year-old. Dawid
