More info:
Classifying docs (same train/test set) as "Republicans" or "Democrats"
yields:
[java] Summary
[java] -------------------------------------------------------
[java] Correctly Classified Instances :
56 76.7123%
[java] Incorrectly Classified Instances :
17 23.2877%
[java] Total Classified Instances : 73
[java]
[java] =======================================================
[java] Confusion Matrix
[java] -------------------------------------------------------
[java] a b <--Classified as
[java] 21 9 | 30 a = democrats
[java] 8 35 | 43 b = republicans
[java] Default Category: unknown: 2
[java]
[java]
For these, the training data was roughly equal in size (both about
1.5MB) and for the test I got about 81% right for Republicans and 70%
for the Democrats (does this imply Repub's do a better job of sticking
to message on Wikipedia than Dems? :-) Would be interesting to
train on a larger set).
-Grant
On Jul 22, 2009, at 9:50 PM, Robin Anil wrote:
Did you try CBayes. Its supposed to negate the class imbalance effect
to some extend
On Thu, Jul 23, 2009 at 5:02 AM, Ted Dunning<[email protected]>
wrote:
Some learning algorithms deal with this better than others. The
problem is
particularly bad in information retrieval (negative examples
include almost
the entire corpus, positives are a tiny fraction) and fraud (less
than 1% of
the training data is typically fraud).
Down-sampling the over-represented case is the simplest answer
where you
have lots of data. It doesn't help much to have more than 3x more
data for
one case as another anyway (at least in binary decisions).
Another aspect of this is the cost of different errors. For
instance, in
fraud, verifying a transaction with a customer has low cost (but not
non-zero) while not detecting a fraud in progress can be very, very
bad.
False negatives are thus more of a problem than false positives and
the
models are tuned accordingly.
On Wed, Jul 22, 2009 at 4:03 PM, Miles Osborne <[email protected]>
wrote:
this is the class imbalance problem (ie you have many more
instances for
one class than another one).
in this case, you could ensure that the training set was balanced
(50:50);
more interestingly, you can have a prior which corrects for this.
or, you
could over-sample or even under-sample the training set, etc etc.
--
Ted Dunning, CTO
DeepDyve
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search