On Sun, Jan 3, 2010 at 1:47 PM, Grant Ingersoll <[email protected]> wrote:

> Just to add a little bit based on some research I've been doing on the
> subject, it seems there are several techniques for naming clusters, ranging
> from the mundane to the intricate:
>
> 1. Top terms based on weight (e.g. TFIDF) -- Implemented in Mahout in the
> ClusterDumper - Just sort the top terms across the docs in the cluster and
> spit out some subset
> 2. Log-likelihood Ratio (LLR) - Implemented in Mahout in ClusterLabels,
> currently requires Lucene index, but could be modified - Calculates the
> log-likelihood for the terms in the vectors and then sorts based on the
> value and spits out the labels
>

These two are actually closely related when you look at the math involved.
LLR has a few extra terms that help avoid some bad terms.

The LLR method can be extended to use a corpus bigram language model versus
a cluster bigram language model.  This requires counting bigrams over the
entire corpus as well as for the clusters.  If that is feasible, it can give
good results.


> 3. Some type of LSA/SVD approach - Implemented in Carrot2, others -
> Identify the concepts by taking the SVD of the vectors and then
> determine/use the base concepts derived from that
>

I don't know what Carrot2 is doing, but this often involves scanning phrases
from the documents to find the ones most saliently related to the cluster
centroid.  It can have very nice results, especially if you add some sense
of how unusual the phrase is in the corpus.


> 4. Frequent Phrases using something like a suffix tree or other phrase
> detection methods - Implemented in in Carrot2 (Suffix Tree Clustering) and
> others - finds frequent phrases and sorts them based on a weight to return
>

This is also related to 1 and 2 except that it uses phrases and basically
blows off the IDF part of the weighting (which is plausible since almost all
phrases are pretty rare).  It is subject to problems where you get fixed
phrases that proliferate through the corpus.  My own nasty experience was
with the phrase "Staff writer of the Wall Street Journal" which seemed
highly significant for several articles (for a moment).

In my recent experience, suffix trees are often over-kill since you can
count all the phrases in a small document set (a cluster) very, very
quickly.  You don't even usually need to look at all of the members of a
cluster, just the few dozen closest to the centroid.


> I'm probably missing some other approaches so feel free to fill in, but
> those are what I've come across so far.
>
> -Grant
>
> On Jan 3, 2010, at 3:07 PM, Ted Dunning wrote:
>
> > Good thing to do.
> >
> > Slightly tricky to do.  But worthy.
> >
> > On Sun, Jan 3, 2010 at 11:04 AM, Grant Ingersoll <[email protected]
> >wrote:
> >
> >> My first thought is to just create an n-gram model of the same field I'm
> >> clustering on (as that will allow the existing code to work unmodified),
> but
> >> I wanted to hear what others think.  Is it worth the time?
> >>
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
>
>
>


-- 
Ted Dunning, CTO
DeepDyve

Reply via email to