So since you're building both a classifier and a search index, I'm guessing to train your classifier you have at least some example docs to train on, right? If you have an n-way classifier in which one of the classes is "other/unclassified", then you could look for ngrams which are overrepresented in the union of the classes which aren't "other" (ie these ngrams are representative of some useful class). These ngrams could form your whitelist.
-jake On Feb 16, 2010 10:23 AM, "Drew Farris" <[email protected]> wrote: Hi Jake, Yes, I'm using the LLR score. I was wondering if there is anything else I should be looking at other than LLR and min/max DF. The corpus is large and the list is too big to review by hand, so wondering if there's any sort of additional measure I can use to suggest whether I should consider stopping additional subgrams or something of that nature. Ideally, this would be something that could be rolled back into the existing collocation identifier in Mahout. Thanks, Drew (Thanks also Ken, Jason for the comments and pointers -- DF is highly effective indeed.) On Tue, Feb 16, 2010 at 1:03 PM, Jake Mannix <[email protected]> wrote: > Drew, > > Did you p...
