On Feb 16, 2010, at 8:28am, Drew Farris wrote:

I have a collection of about 800k bigrams from a corpus of 3.7m
documents that I'm in the process of working with. I'm looking to
determine an appropriate subset of these to use both as features for
both an ML and an IR application. Specifically I'm considering
white-listing a subset of these to use as features when building a
classifier and separately as terms when building an index and doing
query parsing. As a part of the earlier collocation discussion Ted
mentioned that tests for over-representation could be used to identify
dubious members of such a set.

Does anyone have any pointers to discussions of how such a test could
be implemented?

Wouldn't simple df (document frequency) be a reasonable metric for this?

From what I've seen in Lucene indexes, a ranked list of terms (by df) has a pretty sharp elbow that you could use as the cut-off point.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to