On Fri, Jan 8, 2010 at 4:44 AM, Shashikant Kore <[email protected]>wrote:
> On Fri, Jan 8, 2010 at 10:36 AM, Robin Anil <[email protected]> wrote: > > > > One interesting thing I found was that any ngram with LLR <1 is > practically > > junk, anything over LLR>50 is pretty awesome. between 1-50, its always > > debatable. This holds approximately true for large and small datasets. > > > > I don't think the absolute value of LLR score is an indicator of > importance of a term across all dataset. > > With corpus of million documents, if I calculate LLR score of terms in > a set of say 50,000 documents, I get hundreds of terms with score more > than 50, many of which are not "useful." > In my case, when doing LLR on bigrams on the corpus of all 50M+ LinkedIn profiles, if you order by LLR descending, they start out *huge* (10^5 or so, for specialized bigrams like "myocardial infarction" which is about as non-independent as it gets), and go down gradually from there. Since the form of the math for LLR for bigrams is a sum of count * log(probability), the overall size of the corpus is partly a multiplicative factor in the score, and so doesn't really have a good scale-independent measure, only relative - which is why I've always just said "gimme the top 0.1% to 1% (ordered by LLR) ngrams" out of my set. -jake
