> A nice way to do that is the log-likelihood ratio test that I use for
> everything under the sun.  This would consider in-cluster and out-of-cluster
> as two classes and would consider the frequency of each possible term or
> phrase in these two classes.  This will give you words and phrases that are
> anomalously common in your cluster and relatively rare outside it.

This seems to rely on the data source to be homogenous in some
respect.  For instance, your wall street journal source was fairly
consistent wrt the author.  What about those sources that lack
homogeneity?  For instance, I once tried to apply something similar to
this to automatic determination of nicknames.  I had a large corpus of
names and their connections so I could tell that Bill and William
belonged to the same individual.  The problem was that the collection
was so large that ANY repeated connection looked statistically
significant (I was using chi-squares).  I eventually had to apply a
cutoff, but I wonder if there was a more elegant way to do it.  I
realize this is not the same thing as the OP's question - hope you
don't mind :)

Tanton

Reply via email to