On Tue, Feb 16, 2010 at 1:38 PM, Ted Dunning <[email protected]> wrote:

> I think that as far as pure corpus analysis is concerned, LLR, min/max DF
> and tf-idf are about as good as you will get.  TF-idf is, in fact, an
> approximation of LLR, so I don't even think you need to use that (and it is
> document centered rather than corpus centric in any case).  You might get
> some mileage out of looking for terms that have highly variable LLR in
> different documents.
>

Am I incorrect in thinking that the events used for LLR here are the
occurrences of the individual terms in a bigram?  I'm looking here:

http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java?view=markup

I don't follow the argument that tf-idf is an approximation of LLR.  Are you
referring to the Papineni paper?

FWIW, I've found Residual IDF to be more effective than IDF at selecting
words.  Another useful approach is to look for bigrams which have a "peaked"
distribution; that is, considering their document frequency, they have
unusually large within-document counts.

Jason

-- 
Jason Rennie
Research Scientist, ITA Software
617-714-2645
http://www.itasoftware.com/

Reply via email to