Re: n-gram over-representation?

Jake Mannix Tue, 16 Feb 2010 10:52:31 -0800

So since you're building both a classifier and a search index, I'm guessing
to train your classifier you have at least some example docs to train on,
right?   If you have an n-way classifier in which one of the classes is
"other/unclassified", then you could look for ngrams which are
overrepresented in the union of the classes which aren't "other" (ie these
ngrams are representative of some useful class).  These ngrams could form
your whitelist.


  -jake

On Feb 16, 2010 10:23 AM, "Drew Farris" <[email protected]> wrote:

Hi Jake,

Yes, I'm using the LLR score. I was wondering if there is anything
else I should be looking at other than LLR and min/max DF. The corpus
is large and the list is too big to review by hand, so wondering if
there's any sort of additional measure I can use to suggest whether I
should consider stopping additional subgrams or something of that
nature.

Ideally, this would be something that could be rolled back into the
existing collocation identifier in Mahout.

Thanks,

Drew

(Thanks also Ken, Jason for the comments and pointers -- DF is highly
effective indeed.)

On Tue, Feb 16, 2010 at 1:03 PM, Jake Mannix <[email protected]> wrote:
> Drew, > >  Did you p...

Re: n-gram over-representation?

Reply via email to