Actually, to be clear, I looked through the CrossFoldLearner code, and understand how it gets calculated.. but I'm surprised that the discrepancy is so large..
Take care, -stu ________________________________ From: Stuart Smith <[email protected]> To: Mahout List <[email protected]> Sent: Monday, January 23, 2012 2:54 PM Subject: SGD: mismatch in percentCorrect vs classify() on training data? Hello, I just started experimenting with the SGD/Logistic Regression classifier. Right now I believe have too little training data for the number of dimensions (~1800 vector, roughly even split between two classes, ~500 dimensions). However, I'm just trying to understand how to measure the efficacy of the classifier. I trained a classifier like so: - I have two categories, "good" and "bad" - ran AdaptiveLogisticRegression() over the training data 10 times (in the same order) - get percentCorrect and AUC of the best classifier - Took .getBest().getPayload().getLearner(), trained that over all the training data again. (on the theory that ALR was only showing it a small slice of the data that it had, it seemed to help). - get percentCorrect() of the classifier. - run classify() on the good/bad vectors of the training set, counting FP/TP in each case. What I'm having trouble with is understanding a discrepancy between the results of the last two steps. .percentCorrect() returns ~ 90% M = Number of training examples however (TP_Good + TP_Bad) / (M) ~ 50% Interestingly enough (TP_Good + FP_Bad) / (M) ~ 90% So I'm kind of confused about what .percentCorrect means... how is this counted? Below is a code snippet where I do the final training & counting, just in case I made some bonehead mistake: /** training best on all data... **/ System.out.println( "Training best on all data.."); ARFFVectorIterable retrainGood = new ARFFVectorIterable(goodArff, new MapBackedARFFModel()); Iterator<Vector> retrainGoodIter = retrainGood.iterator(); while (retrainGoodIter.hasNext()) { bestClassifier.train( goodLabel, retrainGoodIter.next() ); } ARFFVectorIterable retrainBad = new ARFFVectorIterable(badArff, new MapBackedARFFModel()); Iterator<Vector> retrainBadIter = retrainBad.iterator(); while (retrainBadIter.hasNext()) { bestClassifier.train( badLabel, retrainBadIter.next() ); } System.out.println("Best learner percent correct on all data: " + bestClassifier.percentCorrect()); ARFFVectorIterable fpVectors = new ARFFVectorIterable(goodArff, new MapBackedARFFModel()); Iterator<Vector> fpIterator = fpVectors.iterator(); int goodFpCount = 0; int goodTpCount = 0; int testCount = 0; while (fpIterator.hasNext()) { Vector goodVector = fpIterator.next(); double probabilityGood = (1.0 - bestClassifier.classify(goodVector).get(badLabel)); testCount++; if( probabilityGood > 0.0 ) { if( probabilityGood <= 1.0 ) { System.out.print( probabilityGood + "," ); } goodTpCount++; } else { goodFpCount++; } } System.out.println(); System.out.println( "FP count: " + goodFpCount ); System.out.println( "TP of good files: " + goodTpCount ); ARFFVectorIterable tpVectors = new ARFFVectorIterable(badArff, new MapBackedARFFModel()); Iterator<Vector> tpIterator = tpVectors.iterator(); int badTpCount = 0; int badFpCount = 0; while (tpIterator.hasNext()) { Vector badVector = tpIterator.next(); double probabilityBad = bestClassifier.classify(badVector).get(badLabel); testCount++; if( probabilityBad > 0.0 ) { if( probabilityBad <= 1.0 ) { System.out.print( probabilityBad + "," ); } badTpCount++; } else { badFpCount++; } } System.out.println(); System.out.println( "TP count: " + badTpCount ); System.out.println( "FP on bad clusters: " + badFpCount); System.out.println( "Test count: " + testCount ); Any help is appreciated! Take care, -stu
