Re: Getting Started with Classification

Grant Ingersoll Thu, 23 Jul 2009 13:49:15 -0700

More info:

Classifying docs (same train/test set) as "Republicans" or "Democrats"yields:

[java] Summary
     [java] -------------------------------------------------------

[java] Correctly Classified Instances :56 76.7123%[java] Incorrectly Classified Instances :17 23.2877%

     [java] Total Classified Instances              :         73
     [java]
     [java] =======================================================
     [java] Confusion Matrix
     [java] -------------------------------------------------------
     [java] a           b       <--Classified as
     [java] 21          9        |  30          a     = democrats
     [java] 8           35       |  43          b     = republicans
     [java] Default Category: unknown: 2
     [java]
     [java]

For these, the training data was roughly equal in size (both about1.5MB) and for the test I got about 81% right for Republicans and 70%for the Democrats (does this imply Repub's do a better job of stickingto message on Wikipedia than Dems? :-) Would be interesting totrain on a larger set).


-Grant

On Jul 22, 2009, at 9:50 PM, Robin Anil wrote:

Did you try CBayes. Its supposed to negate the class imbalance effect
to some extend
On Thu, Jul 23, 2009 at 5:02 AM, Ted Dunning<[email protected]>wrote:
Some learning algorithms deal with this better than others. Theproblem isparticularly bad in information retrieval (negative examplesinclude almostthe entire corpus, positives are a tiny fraction) and fraud (lessthan 1% of
the training data is typically fraud).
Down-sampling the over-represented case is the simplest answerwhere youhave lots of data. It doesn't help much to have more than 3x moredata for
one case as another anyway (at least in binary decisions).
Another aspect of this is the cost of different errors. Forinstance, in
fraud, verifying a transaction with a customer has low cost (but not
non-zero) while not detecting a fraud in progress can be very, verybad.False negatives are thus more of a problem than false positives andthe
models are tuned accordingly.
On Wed, Jul 22, 2009 at 4:03 PM, Miles Osborne <[email protected]>wrote:
this is the class imbalance problem (ie you have many moreinstances for
one class than another one).
in this case, you could ensure that the training set was balanced(50:50);more interestingly, you can have a prior which corrects for this.or, you
could over-sample or even under-sample the training set, etc etc.
--
Ted Dunning, CTO
DeepDyve


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Getting Started with Classification

Reply via email to