> A nice way to do that is the log-likelihood ratio test that I use for > everything under the sun. This would consider in-cluster and out-of-cluster > as two classes and would consider the frequency of each possible term or > phrase in these two classes. This will give you words and phrases that are > anomalously common in your cluster and relatively rare outside it.
This seems to rely on the data source to be homogenous in some respect. For instance, your wall street journal source was fairly consistent wrt the author. What about those sources that lack homogeneity? For instance, I once tried to apply something similar to this to automatic determination of nicknames. I had a large corpus of names and their connections so I could tell that Bill and William belonged to the same individual. The problem was that the collection was so large that ANY repeated connection looked statistically significant (I was using chi-squares). I eventually had to apply a cutoff, but I wonder if there was a more elegant way to do it. I realize this is not the same thing as the OP's question - hope you don't mind :) Tanton
