More info:

Classifying docs (same train/test set) as "Republicans" or "Democrats" yields:
[java] Summary
     [java] -------------------------------------------------------
[java] Correctly Classified Instances : 56 76.7123% [java] Incorrectly Classified Instances : 17 23.2877%
     [java] Total Classified Instances              :         73
     [java]
     [java] =======================================================
     [java] Confusion Matrix
     [java] -------------------------------------------------------
     [java] a           b       <--Classified as
     [java] 21          9        |  30          a     = democrats
     [java] 8           35       |  43          b     = republicans
     [java] Default Category: unknown: 2
     [java]
     [java]

For these, the training data was roughly equal in size (both about 1.5MB) and for the test I got about 81% right for Republicans and 70% for the Democrats (does this imply Repub's do a better job of sticking to message on Wikipedia than Dems? :-) Would be interesting to train on a larger set).

-Grant

On Jul 22, 2009, at 9:50 PM, Robin Anil wrote:

Did you try CBayes. Its supposed to negate the class imbalance effect
to some extend



On Thu, Jul 23, 2009 at 5:02 AM, Ted Dunning<[email protected]> wrote:
Some learning algorithms deal with this better than others. The problem is particularly bad in information retrieval (negative examples include almost the entire corpus, positives are a tiny fraction) and fraud (less than 1% of
the training data is typically fraud).

Down-sampling the over-represented case is the simplest answer where you have lots of data. It doesn't help much to have more than 3x more data for
one case as another anyway (at least in binary decisions).

Another aspect of this is the cost of different errors. For instance, in
fraud, verifying a transaction with a customer has low cost (but not
non-zero) while not detecting a fraud in progress can be very, very bad. False negatives are thus more of a problem than false positives and the
models are tuned accordingly.

On Wed, Jul 22, 2009 at 4:03 PM, Miles Osborne <[email protected]> wrote:

this is the class imbalance problem (ie you have many more instances for
one class than another one).

in this case, you could ensure that the training set was balanced (50:50); more interestingly, you can have a prior which corrects for this. or, you
could over-sample or even under-sample the training set, etc etc.




--
Ted Dunning, CTO
DeepDyve


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to