Gotta run, but will do tmr. I actually took my feature count down from ~500 to 10, and started getting much better results :) Even with a 10% hold out set (held out from any training whatsover).
So it's looking better, but that stat is still just odd... (even now).. Thanks! Take care, -stu ________________________________ From: Ted Dunning <[email protected]> To: [email protected]; Stuart Smith <[email protected]> Cc: Mahout List <[email protected]> Sent: Monday, January 23, 2012 5:52 PM Subject: Re: SGD: mismatch in percentCorrect vs classify() on training data? Hmm... I am surprised as well. As I remember percentCorrect *is* a weighted moving average so I would expect some discrepancy, but not this much. Can you post your training/test data somewhere? It would be good to test in synchrony. On Mon, Jan 23, 2012 at 3:37 PM, Stuart Smith <[email protected]> wrote: > Actually, to be clear, I looked through the CrossFoldLearner code, and > understand how it gets calculated.. but I'm surprised that the discrepancy > is so large.. > > Take care, > -stu > > > > ________________________________ > From: Stuart Smith <[email protected]> > To: Mahout List <[email protected]> > Sent: Monday, January 23, 2012 2:54 PM > Subject: SGD: mismatch in percentCorrect vs classify() on training data? > > Hello, > > I just started experimenting with the SGD/Logistic Regression classifier. > Right now I believe have too little training data for the number of > dimensions (~1800 vector, roughly even split between two classes, ~500 > dimensions). > > However, I'm just trying to understand how to measure the efficacy of the > classifier. > > I trained a classifier like so: > > - I have two categories, "good" and "bad" > > > - ran AdaptiveLogisticRegression() over the training data 10 times (in the > same order) > > - get percentCorrect and AUC of the best classifier > > > - Took .getBest().getPayload().getLearner(), trained that over all the > training data again. > (on the theory that ALR was only showing it a small slice of the data > that it had, it seemed to help). > > - get percentCorrect() of the classifier. > > - run classify() on the good/bad vectors of the training set, counting > FP/TP in each case. > > What I'm having trouble with is understanding a discrepancy between the > results of the last two steps. > > .percentCorrect() returns ~ 90% > M = Number of training examples > however (TP_Good + TP_Bad) / (M) ~ 50% > Interestingly enough (TP_Good + FP_Bad) / (M) ~ 90% > > > So I'm kind of confused about what .percentCorrect means... how is this > counted? > > Below is a code snippet where I do the final training & counting, just in > case I made some bonehead mistake: > > /** training best on all data... **/ > System.out.println( "Training best on all data.."); > ARFFVectorIterable retrainGood = new > ARFFVectorIterable(goodArff, new MapBackedARFFModel()); > Iterator<Vector> retrainGoodIter = retrainGood.iterator(); > while (retrainGoodIter.hasNext()) { > bestClassifier.train( goodLabel, retrainGoodIter.next() ); > } > > > ARFFVectorIterable retrainBad = new > ARFFVectorIterable(badArff, new MapBackedARFFModel()); > Iterator<Vector> retrainBadIter = retrainBad.iterator(); > while (retrainBadIter.hasNext()) { > bestClassifier.train( badLabel, retrainBadIter.next() ); > } > System.out.println("Best learner percent correct on all data: > " + bestClassifier.percentCorrect()); > > ARFFVectorIterable fpVectors = new > ARFFVectorIterable(goodArff, new MapBackedARFFModel()); > Iterator<Vector> fpIterator = fpVectors.iterator(); > int goodFpCount = 0; > int goodTpCount = 0; > int testCount = 0; > while (fpIterator.hasNext()) > { > > Vector goodVector = fpIterator.next(); > double probabilityGood = (1.0 - > bestClassifier.classify(goodVector).get(badLabel)); > testCount++; > if( probabilityGood > 0.0 ) { > if( probabilityGood <= 1.0 ) { > System.out.print( probabilityGood + "," ); > } > goodTpCount++; > } > else { > goodFpCount++; > } > } > System.out.println(); > System.out.println( "FP count: " + goodFpCount ); > System.out.println( "TP of good files: " + goodTpCount ); > > ARFFVectorIterable tpVectors = new ARFFVectorIterable(badArff, > new MapBackedARFFModel()); > Iterator<Vector> tpIterator = tpVectors.iterator(); > int badTpCount = 0; > int badFpCount = 0; > while (tpIterator.hasNext()) > { > Vector badVector = tpIterator.next(); > double probabilityBad = > bestClassifier.classify(badVector).get(badLabel); > testCount++; > if( probabilityBad > 0.0 ) { > if( probabilityBad <= 1.0 ) { > System.out.print( probabilityBad + "," ); > } > badTpCount++; > } > else { > badFpCount++; > } > } > System.out.println(); > System.out.println( "TP count: " + badTpCount ); > System.out.println( "FP on bad clusters: " + badFpCount); > System.out.println( "Test count: " + testCount ); > > > Any help is appreciated! > > > Take care, > -stu >
