Hi Ted,��� Thanks for the response. To answer your questions:�1. I have 576 categories2. I started with 5 training document per category. Went up to 10 but error levels ramained the same. Am going to up to 30 documents and�am going to increase the length of the documents. �How did you derive the 50 words of training data for some topics? Curious...�S.
----- Original Message ----- From: "Ted Dunning" To: [email protected] Subject: Re: Document size rules of thumb Date: Wed, 7 Oct 2009 10:21:20 -0700 Sandra, This is a classic case of over-fitting. I suspect training data inadequacy. One thing you don't say is how many categories you have and how many training documents per categories you have. You point (2) might indicate that you have as little as 50 words of training data for some topics. That would make it difficult for even the best classifiers to get a sharp result. I would recommend the following: a) get more training data (always a good thing even if often infeasible) b) try a few other algorithms. I would recommend trying Luduan (from my dissertation, pdf sent to you in a separate email), confidence weighted learning (see http://www.cs.jhu.edu/~mdredze/publications/, especially http://www.aclweb.org/anthology-new/D/D09/D09-1052.pdf) and vowpal ( http://hunch.net/~vw/) c) post your data for others to try Hope this helps. On Wed, Oct 7, 2009 at 9:37 AM, Sandra Clover wrote: > 0. The setup is Mahout 0.1 & Hadoop 0.19.2 – I think I am using a > branch version. Currently trying to install the trunk version > > 1. The data I am trying to classify is from scientific papers - > essentially the abstract title, text and keywords of there paper - > example below > > 2. No data source is under 300 characters > > 3. I am training using the Mahout naive Bayes and am getting low > incorrectly classified rates something like: 1.67% - I’m quite happy > with that… > > 4. After I have trained the model Robin I use the Mahout naive Bayes > classify() method to classify new (unseen) data (with the classification > already known) - this is where I start to get problems - I get very poor > successful classification rates for new data. Something like: 82% > unsuccessful classified. > > > > To Summarise: I get very good results in training and very poor results > with new data. > -- Ted Dunning, CTO DeepDyve -- Be Yourself @ mail.com! Choose From 200+ Email Addresses Get a Free Account at www.mail.com!
