Shashikant is correct that LLR becomes more and more sensitive with larger corpora.
Whether this is good or bad depends on what the use is. If you are using these as ML features, hundreds of extra phrases is probably neutral to slightly helpful. If these phrases are intended to be user visible, some additional filtering is likely to be required for very large corpus applications. This can be linguistic (sentence and phrase boundary limits) or can be based on statistical filtering in the particular situation (such as over-representation in a cluster). I have used LLR for feature detection and description for fairly large corpora in the past, but typically had a test for over-representation in place which prevented dubious phrases from being presented to the user. On Fri, Jan 8, 2010 at 4:44 AM, Shashikant Kore <[email protected]>wrote: > I don't think the absolute value of LLR score is an indicator of > importance of a term across all dataset. > > With corpus of million documents, if I calculate LLR score of terms in > a set of say 50,000 documents, I get hundreds of terms with score more > than 50, many of which are not "useful." > > Ted, can you please comment on Robin's observation? > -- Ted Dunning, CTO DeepDyve
