As Ken noted, DF is a reasonable metric for term selection.  If you're
interested in additional discussion and/or more a sophisticated approach,
you might be interested in a paper I wrote on the topic of identifying
"informative" terms:

http://people.csail.mit.edu/jrennie/papers/sigir05-informativeness.pdf

Cheers,

Jason

On Tue, Feb 16, 2010 at 11:28 AM, Drew Farris <[email protected]> wrote:

> I have a collection of about 800k bigrams from a corpus of 3.7m
> documents that I'm in the process of working with. I'm looking to
> determine an appropriate subset of these to use both as features for
> both an ML and an IR application. Specifically I'm considering
> white-listing a subset of these to use as features when building a
> classifier and separately as terms when building an index and doing
> query parsing. As a part of the earlier collocation discussion Ted
> mentioned that tests for over-representation could be used to identify
> dubious members of such a set.
>
> Does anyone have any pointers to discussions of how such a test could
> be implemented?
>
> Thanks,
>
> Drew
>



-- 
Jason Rennie
Research Scientist, ITA Software
617-714-2645
http://www.itasoftware.com/

Reply via email to