Re: Document size rules of thumb

Sandra Clover Thu, 08 Oct 2009 01:04:21 -0700

Hi Ted,��� Thanks for the response. To answer your questions:�1. I have
576 categories2. I started with 5 training document per category. Went up
to 10 but error levels ramained the same. Am going to up to 30 documents
and�am going to increase the length of the documents. �How did you derive
the 50 words of training data for some topics? Curious...�S.

  ----- Original Message -----
  From: "Ted Dunning"
  To: [email protected]
  Subject: Re: Document size rules of thumb
  Date: Wed, 7 Oct 2009 10:21:20 -0700

  Sandra,

  This is a classic case of over-fitting. I suspect training data
  inadequacy. One thing you don't say is how many categories you have
  and how
  many training documents per categories you have. You point (2) might
  indicate that you have as little as 50 words of training data for
  some
  topics. That would make it difficult for even the best classifiers to
  get a
  sharp result.

  I would recommend the following:

  a) get more training data (always a good thing even if often
  infeasible)

  b) try a few other algorithms. I would recommend trying Luduan (from
  my
  dissertation, pdf sent to you in a separate email), confidence
  weighted
  learning (see http://www.cs.jhu.edu/~mdredze/publications/,
  especially
  http://www.aclweb.org/anthology-new/D/D09/D09-1052.pdf) and vowpal (
  http://hunch.net/~vw/)

  c) post your data for others to try

  Hope this helps.

  On Wed, Oct 7, 2009 at 9:37 AM, Sandra Clover wrote:

  > 0. The setup is Mahout 0.1 & Hadoop 0.19.2 – I think I am using a
  > branch version. Currently trying to install the trunk version
  >
  > 1. The data I am trying to classify is from scientific papers -
  > essentially the abstract title, text and keywords of there paper -
  > example below
  >
  > 2. No data source is under 300 characters
  >
  > 3. I am training using the Mahout naive Bayes and am getting low
  > incorrectly classified rates something like: 1.67% - I’m quite
  happy
  > with that…
  >
  > 4. After I have trained the model Robin I use the Mahout naive
  Bayes
  > classify() method to classify new (unseen) data (with the
  classification
  > already known) - this is where I start to get problems - I get very
  poor
  > successful classification rates for new data. Something like: 82%
  > unsuccessful classified.
  >
  >
  >
  > To Summarise: I get very good results in training and very poor
  results
  > with new data.
  >

  --
  Ted Dunning, CTO
  DeepDyve

-- 
Be Yourself @ mail.com!
Choose From 200+ Email Addresses
Get a Free Account at www.mail.com!

Re: Document size rules of thumb

Reply via email to