As Ken noted, DF is a reasonable metric for term selection. If you're interested in additional discussion and/or more a sophisticated approach, you might be interested in a paper I wrote on the topic of identifying "informative" terms:
http://people.csail.mit.edu/jrennie/papers/sigir05-informativeness.pdf Cheers, Jason On Tue, Feb 16, 2010 at 11:28 AM, Drew Farris <[email protected]> wrote: > I have a collection of about 800k bigrams from a corpus of 3.7m > documents that I'm in the process of working with. I'm looking to > determine an appropriate subset of these to use both as features for > both an ML and an IR application. Specifically I'm considering > white-listing a subset of these to use as features when building a > classifier and separately as terms when building an index and doing > query parsing. As a part of the earlier collocation discussion Ted > mentioned that tests for over-representation could be used to identify > dubious members of such a set. > > Does anyone have any pointers to discussions of how such a test could > be implemented? > > Thanks, > > Drew > -- Jason Rennie Research Scientist, ITA Software 617-714-2645 http://www.itasoftware.com/
