On Fri, Jan 8, 2010 at 4:44 AM, Shashikant Kore <[email protected]>wrote:

> On Fri, Jan 8, 2010 at 10:36 AM, Robin Anil <[email protected]> wrote:
> >
> > One interesting thing I found was that any ngram with LLR <1 is
> practically
> > junk, anything over LLR>50 is pretty awesome. between 1-50, its always
> > debatable. This holds approximately true for large and small datasets.
> >
>
> I don't think the absolute value of LLR score is an indicator of
> importance of a term across all dataset.
>
> With corpus of million documents, if I calculate LLR score of terms in
> a set of say 50,000 documents, I get hundreds of terms with score more
> than 50, many of which are not "useful."
>

In my case, when doing LLR on bigrams on the corpus of all 50M+ LinkedIn
profiles, if you order by LLR descending, they start out *huge* (10^5 or so,
for specialized bigrams like "myocardial infarction" which is about as
non-independent as it gets), and go down gradually from there.

Since the form of the math for LLR for bigrams is a sum of
count * log(probability), the overall size of the corpus is partly a
multiplicative factor in the score, and so doesn't really have a good
scale-independent measure, only relative - which is why I've always
just said "gimme the top 0.1% to 1% (ordered by LLR) ngrams"
out of my set.

  -jake

Reply via email to