Shashikant is correct that LLR becomes more and more sensitive with larger
corpora.

Whether this is good or bad depends on what the use is.  If you are using
these as ML features, hundreds of extra phrases is probably neutral to
slightly helpful.

If these phrases are intended to be user visible, some additional filtering
is likely to be required for very large corpus applications.  This can be
linguistic (sentence and phrase boundary limits) or can be based on
statistical filtering in the particular situation (such as
over-representation in a cluster).

I have used LLR for feature detection and description for fairly large
corpora in the past, but typically had a test for over-representation in
place which prevented dubious phrases from being presented to the user.

On Fri, Jan 8, 2010 at 4:44 AM, Shashikant Kore <[email protected]>wrote:

> I don't think the absolute value of LLR score is an indicator of
> importance of a term across all dataset.
>
> With corpus of million documents, if I calculate LLR score of terms in
> a set of say 50,000 documents, I get hundreds of terms with score more
> than 50, many of which are not "useful."
>
> Ted, can you please comment on Robin's observation?
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to