> 1. Top terms based on weight (e.g. TFIDF) -- Implemented in Mahout in the 
> ClusterDumper - Just sort the top terms across the docs in the cluster and 
> spit out some subset
> 2. Log-likelihood Ratio (LLR) - Implemented in Mahout in ClusterLabels, 
> currently requires Lucene index, but could be modified - Calculates the 
> log-likelihood for the terms in the vectors and then sorts based on the value 
> and spits out the labels
> 3. Some type of LSA/SVD approach - Implemented in Carrot2, others - Identify 
> the concepts by taking the SVD of the vectors and then determine/use the base 
> concepts derived from that
> 4. Frequent Phrases using something like a suffix tree or other phrase 
> detection methods - Implemented in in Carrot2 (Suffix Tree Clustering) and 
> others - finds frequent phrases and sorts them based on a weight to return

Just a short follow-up note to this discussion.

The major difference is perhaps not in how you select labels, but how
you approach clustering AND labelling together -- whether you want to
find labels for existing groups of items or (which is what Carrot2
does) whether you want to find good potential cluster labels first and
then reject (or select) these labels that actually correspond to the
groups of documents found in the input. The actual algorithmic method
follows either of these paradigms.

I realize the difference may seem small at first, but it has huge
consequences in what you consider a cluster, clustering quality, etc.
Consider these facts:

- Clusters for which there are no good labels can (and probably
should) be discarded. We simply cannot convey their meaning to the
user, so why bother.

- In the second approach the structure of clusters is strongly
affected by the labels and the mechanism of assigning documents to
these labels. If you keep it strict (e.g. "every item in the cluster
must contain the cluster's exact phrase"), then the relationship
between a cluster's label and its content will directly convey
something to the user. If you allow some fuzziness (e.g., "every iterm
in the cluster must contain at least one word from the set of keywords
forming the cluster's label"), then the label is only a hint and the
user must still come up with the "glue" (whatever it is that caused
these particular keywords to be grouped together).

What's worse -- neither method is "better". We at Carrot2 have a
strong feeling that clusters should be described properly in order to
be useful, but one may argue that in many, many applications of
clustering, the labels are _not_ important and just individual
features of clusters (like keywords or even documents themselves) are
enough.

Back to the labelling methods Grant described -- STC and Lingo (from
Carrot2) are representatives of the "labels come first" apprach and
their usefulness will be limited in labelling existing document
clusters. STC, for example, is basically a complex feature extraction
algorithm (extraction of frequent phrases using suffix trees),
followed by a simple single-link greedy heuristic merging the document
sets implied by these features. Whether this approach can be scaled up
to labelling existing clusters -- I don't know.

What is certainly worth investigation is extraction of better
features. n-grams are a good candidate (especially with pruning
techniques). Suffix arrays could be used to speed up the process to
some extent (batching a good sample of documents in a single mapper,
running frequent phrase detection, emitting only frequent phrases that
exceed a given threshold in the batch). If you'd like to experiment
with this approach, suffix tree code is in Carrot2 and suffix arrays
are here: http://www.jsuffixarrays.org

I don't know if it helps in any way. Apologies if I've been
inconsistent in any of the above -- had about a dozen urgent (and I
mean: urgent) interrupts from my two-year-old.

Dawid

Reply via email to