Re: SGD: mismatch in percentCorrect vs classify() on training data?

Stuart Smith Mon, 23 Jan 2012 19:18:54 -0800

Gotta run, but will do tmr.

I actually took my feature count down from ~500 to  10, and started getting 
much better results :)
Even with a 10% hold out set (held out from any training whatsover).


So it's looking better, but that stat is still just odd... (even now)..

Thanks!


Take care,
  -stu



________________________________
 From: Ted Dunning <[email protected]>
To: [email protected]; Stuart Smith <[email protected]> 
Cc: Mahout List <[email protected]> 
Sent: Monday, January 23, 2012 5:52 PM
Subject: Re: SGD: mismatch in percentCorrect vs classify() on training data?
 
Hmm... I am surprised as well.

As I remember percentCorrect *is* a weighted moving average so I would
expect some discrepancy, but not this much.

Can you post your training/test data somewhere?  It would be good to test
in synchrony.

On Mon, Jan 23, 2012 at 3:37 PM, Stuart Smith <[email protected]> wrote:

> Actually, to be clear, I looked through the CrossFoldLearner code, and
> understand how it gets calculated.. but I'm surprised that the discrepancy
> is so large..
>
> Take care,
>   -stu
>
>
>
> ________________________________
>  From: Stuart Smith <[email protected]>
> To: Mahout List <[email protected]>
> Sent: Monday, January 23, 2012 2:54 PM
> Subject: SGD: mismatch in percentCorrect vs classify() on training data?
>
> Hello,
>
>   I just started experimenting with the SGD/Logistic Regression classifier.
> Right now I believe have too little training data for the number of
> dimensions (~1800 vector, roughly even split between two classes, ~500
> dimensions).
>
> However, I'm just trying to understand how to measure the efficacy of the
> classifier.
>
> I trained a classifier like so:
>
> - I have two categories, "good" and "bad"
>
>
> - ran AdaptiveLogisticRegression() over the training data 10 times (in the
> same order)
>
> - get percentCorrect and AUC of the best classifier
>
>
> - Took .getBest().getPayload().getLearner(), trained that over all the
> training data again.
>    (on the theory that ALR was only showing it a small slice of the data
> that it had, it seemed to help).
>
> - get percentCorrect() of the classifier.
>
> - run classify() on the good/bad vectors of the training set, counting
> FP/TP in each case.
>
> What I'm having trouble with is understanding a discrepancy between the
> results of the last two steps.
>
> .percentCorrect() returns ~ 90%
> M = Number of training examples
> however (TP_Good + TP_Bad) / (M) ~ 50%
> Interestingly enough (TP_Good + FP_Bad) / (M) ~ 90%
>
>
> So I'm kind of confused about what .percentCorrect means... how is this
> counted?
>
> Below is a code snippet where I do the final training & counting, just in
> case I made some bonehead mistake:
>
>             /** training best on all data... **/
>             System.out.println( "Training best on all data..");
>             ARFFVectorIterable retrainGood = new
> ARFFVectorIterable(goodArff, new MapBackedARFFModel());
>             Iterator<Vector> retrainGoodIter = retrainGood.iterator();
>             while (retrainGoodIter.hasNext()) {
>                 bestClassifier.train( goodLabel, retrainGoodIter.next() );
>             }
>
>
>             ARFFVectorIterable retrainBad = new
> ARFFVectorIterable(badArff, new MapBackedARFFModel());
>             Iterator<Vector> retrainBadIter = retrainBad.iterator();
>             while (retrainBadIter.hasNext()) {
>                 bestClassifier.train( badLabel, retrainBadIter.next() );
>             }
>             System.out.println("Best learner percent correct on all data:
> " + bestClassifier.percentCorrect());
>
>             ARFFVectorIterable fpVectors = new
> ARFFVectorIterable(goodArff, new MapBackedARFFModel());
>             Iterator<Vector> fpIterator = fpVectors.iterator();
>             int goodFpCount = 0;
>             int goodTpCount = 0;
>             int testCount = 0;
>             while (fpIterator.hasNext())
>             {
>
>                 Vector goodVector = fpIterator.next();
>                 double probabilityGood = (1.0 -
> bestClassifier.classify(goodVector).get(badLabel));
>                 testCount++;
>                 if( probabilityGood > 0.0 ) {
>                     if( probabilityGood <= 1.0 ) {
>                         System.out.print( probabilityGood + "," );
>                     }
>                     goodTpCount++;
>                 }
>                 else {
>                     goodFpCount++;
>                 }
>             }
>             System.out.println();
>             System.out.println( "FP count: " + goodFpCount );
>             System.out.println( "TP of good files: " + goodTpCount );
>
>             ARFFVectorIterable tpVectors = new ARFFVectorIterable(badArff,
> new MapBackedARFFModel());
>             Iterator<Vector> tpIterator = tpVectors.iterator();
>             int badTpCount = 0;
>             int badFpCount = 0;
>             while (tpIterator.hasNext())
>             {
>                 Vector badVector = tpIterator.next();
>                 double probabilityBad =
> bestClassifier.classify(badVector).get(badLabel);
>                 testCount++;
>                 if( probabilityBad > 0.0 ) {
>                     if( probabilityBad <= 1.0 ) {
>                         System.out.print( probabilityBad + "," );
>                     }
>                     badTpCount++;
>                 }
>                 else {
>                     badFpCount++;
>                 }
>             }
>             System.out.println();
>             System.out.println( "TP count: " + badTpCount );
>             System.out.println( "FP on bad clusters: " + badFpCount);
>             System.out.println( "Test count: " + testCount );
>
>
> Any help is appreciated!
>
>
> Take care,
>   -stu
>

Re: SGD: mismatch in percentCorrect vs classify() on training data?

Reply via email to