Actually, to be clear, I looked through the CrossFoldLearner code, and 
understand how it gets calculated.. but I'm surprised that the discrepancy is 
so large..

Take care,
  -stu



________________________________
 From: Stuart Smith <[email protected]>
To: Mahout List <[email protected]> 
Sent: Monday, January 23, 2012 2:54 PM
Subject: SGD: mismatch in percentCorrect vs classify() on training data?
 
Hello,

  I just started experimenting with the SGD/Logistic Regression classifier.
Right now I believe have too little training data for the number of dimensions 
(~1800 vector, roughly even split between two classes, ~500 dimensions).

However, I'm just trying to understand how to measure the efficacy of the 
classifier.

I trained a classifier like so:

- I have two categories, "good" and "bad"


- ran AdaptiveLogisticRegression() over the training data 10 times (in the same 
order)

- get percentCorrect and AUC of the best classifier


- Took .getBest().getPayload().getLearner(), trained that over all the training 
data again.
   (on the theory that ALR was only showing it a small slice of the data that 
it had, it seemed to help).

- get percentCorrect() of the classifier.

- run classify() on the good/bad vectors of the training set, counting FP/TP in 
each case.

What I'm having trouble with is understanding a discrepancy between the results 
of the last two steps.

.percentCorrect() returns ~ 90% 
M = Number of training examples
however (TP_Good + TP_Bad) / (M) ~ 50%
Interestingly enough (TP_Good + FP_Bad) / (M) ~ 90%


So I'm kind of confused about what .percentCorrect means... how is this counted?

Below is a code snippet where I do the final training & counting, just in case 
I made some bonehead mistake:

            /** training best on all data... **/
            System.out.println( "Training best on all data..");
            ARFFVectorIterable retrainGood = new ARFFVectorIterable(goodArff, 
new MapBackedARFFModel());
            Iterator<Vector> retrainGoodIter = retrainGood.iterator();
            while (retrainGoodIter.hasNext()) {
                bestClassifier.train( goodLabel, retrainGoodIter.next() );
            }

            
            ARFFVectorIterable retrainBad = new ARFFVectorIterable(badArff, new 
MapBackedARFFModel());
            Iterator<Vector> retrainBadIter = retrainBad.iterator();
            while (retrainBadIter.hasNext()) {
                bestClassifier.train( badLabel, retrainBadIter.next() );
            }
            System.out.println("Best learner percent correct on all data: " + 
bestClassifier.percentCorrect());

            ARFFVectorIterable fpVectors = new ARFFVectorIterable(goodArff, new 
MapBackedARFFModel());
            Iterator<Vector> fpIterator = fpVectors.iterator();
            int goodFpCount = 0;
            int goodTpCount = 0;
            int testCount = 0;
            while (fpIterator.hasNext())
            {

                Vector goodVector = fpIterator.next();
                double probabilityGood = (1.0 - 
bestClassifier.classify(goodVector).get(badLabel));
                testCount++;
                if( probabilityGood > 0.0 ) {
                    if( probabilityGood <= 1.0 ) {
                        System.out.print( probabilityGood + "," );
                    }
                    goodTpCount++;
                }
                else {
                    goodFpCount++;
                }
            }
            System.out.println();
            System.out.println( "FP count: " + goodFpCount );
            System.out.println( "TP of good files: " + goodTpCount );
            
            ARFFVectorIterable tpVectors = new ARFFVectorIterable(badArff, new 
MapBackedARFFModel());
            Iterator<Vector> tpIterator = tpVectors.iterator();
            int badTpCount = 0;
            int badFpCount = 0;
            while (tpIterator.hasNext())
            {
                Vector badVector = tpIterator.next();
                double probabilityBad = 
bestClassifier.classify(badVector).get(badLabel);
                testCount++;
                if( probabilityBad > 0.0 ) {
                    if( probabilityBad <= 1.0 ) {
                        System.out.print( probabilityBad + "," );
                    }
                    badTpCount++;
                }
                else {
                    badFpCount++;
                }
            }
            System.out.println();
            System.out.println( "TP count: " + badTpCount );
            System.out.println( "FP on bad clusters: " + badFpCount);
            System.out.println( "Test count: " + testCount );


Any help is appreciated! 


Take care,
  -stu

Reply via email to