On Tue, Feb 16, 2010 at 1:38 PM, Ted Dunning <[email protected]> wrote:
> I think that as far as pure corpus analysis is concerned, LLR, min/max DF > and tf-idf are about as good as you will get. TF-idf is, in fact, an > approximation of LLR, so I don't even think you need to use that (and it is > document centered rather than corpus centric in any case). You might get > some mileage out of looking for terms that have highly variable LLR in > different documents. > Am I incorrect in thinking that the events used for LLR here are the occurrences of the individual terms in a bigram? I'm looking here: http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java?view=markup I don't follow the argument that tf-idf is an approximation of LLR. Are you referring to the Papineni paper? FWIW, I've found Residual IDF to be more effective than IDF at selecting words. Another useful approach is to look for bigrams which have a "peaked" distribution; that is, considering their document frequency, they have unusually large within-document counts. Jason -- Jason Rennie Research Scientist, ITA Software 617-714-2645 http://www.itasoftware.com/
