Re: n-gram over-representation?

Drew Farris Tue, 16 Feb 2010 10:23:03 -0800

Hi Jake,

Yes, I'm using the LLR score. I was wondering if there is anything
else I should be looking at other than LLR and min/max DF. The corpus
is large and the list is too big to review by hand, so wondering if
there's any sort of additional measure I can use to suggest whether I
should consider stopping additional subgrams or something of that
nature.


Ideally, this would be something that could be rolled back into the
existing collocation identifier in Mahout.

Thanks,

Drew

(Thanks also Ken, Jason for the comments and pointers -- DF is highly
effective indeed.)


On Tue, Feb 16, 2010 at 1:03 PM, Jake Mannix <[email protected]> wrote:
> Drew,
>
>  Did you pick your whitelist using the LLR score?  What is the kind of
> over-representation you're trying to prune out?  DF will certainly help you
> remove "too common" bigrams, but that's not what you're looking for, is it?
>
>  -jake
>
> On Feb 16, 2010 8:29 AM, "Drew Farris" <[email protected]> wrote:
>
> I have a collection of about 800k bigrams from a corpus of 3.7m
> documents that I'm in the process of working with. I'm looking to
> determine an appropriate subset of these to use both as features for
> both an ML and an IR application. Specifically I'm considering
> white-listing a subset of these to use as features when building a
> classifier and separately as terms when building an index and doing
> query parsing. As a part of the earlier collocation discussion Ted
> mentioned that tests for over-representation could be used to identify
> dubious members of such a set.
>
> Does anyone have any pointers to discussions of how such a test could
> be implemented?
>
> Thanks,
>
> Drew
>

Re: n-gram over-representation?

Reply via email to