Hello,
I just started experimenting with the SGD/Logistic Regression classifier.
Right now I believe have too little training data for the number of dimensions
(~1800 vector, roughly even split between two classes, ~500 dimensions).
However, I'm just trying to understand how to measure the efficacy of the
classifier.
I trained a classifier like so:
- I have two categories, "good" and "bad"
- ran AdaptiveLogisticRegression() over the training data 10 times (in the same
order)
- get percentCorrect and AUC of the best classifier
- Took .getBest().getPayload().getLearner(), trained that over all the training
data again.
(on the theory that ALR was only showing it a small slice of the data that
it had, it seemed to help).
- get percentCorrect() of the classifier.
- run classify() on the good/bad vectors of the training set, counting FP/TP in
each case.
What I'm having trouble with is understanding a discrepancy between the results
of the last two steps.
.percentCorrect() returns ~ 90%
M = Number of training examples
however (TP_Good + TP_Bad) / (M) ~ 50%
Interestingly enough (TP_Good + FP_Bad) / (M) ~ 90%
So I'm kind of confused about what .percentCorrect means... how is this counted?
Below is a code snippet where I do the final training & counting, just in case
I made some bonehead mistake:
/** training best on all data... **/
System.out.println( "Training best on all data..");
ARFFVectorIterable retrainGood = new ARFFVectorIterable(goodArff,
new MapBackedARFFModel());
Iterator<Vector> retrainGoodIter = retrainGood.iterator();
while (retrainGoodIter.hasNext()) {
bestClassifier.train( goodLabel, retrainGoodIter.next() );
}
ARFFVectorIterable retrainBad = new ARFFVectorIterable(badArff, new
MapBackedARFFModel());
Iterator<Vector> retrainBadIter = retrainBad.iterator();
while (retrainBadIter.hasNext()) {
bestClassifier.train( badLabel, retrainBadIter.next() );
}
System.out.println("Best learner percent correct on all data: " +
bestClassifier.percentCorrect());
ARFFVectorIterable fpVectors = new ARFFVectorIterable(goodArff, new
MapBackedARFFModel());
Iterator<Vector> fpIterator = fpVectors.iterator();
int goodFpCount = 0;
int goodTpCount = 0;
int testCount = 0;
while (fpIterator.hasNext())
{
Vector goodVector = fpIterator.next();
double probabilityGood = (1.0 -
bestClassifier.classify(goodVector).get(badLabel));
testCount++;
if( probabilityGood > 0.0 ) {
if( probabilityGood <= 1.0 ) {
System.out.print( probabilityGood + "," );
}
goodTpCount++;
}
else {
goodFpCount++;
}
}
System.out.println();
System.out.println( "FP count: " + goodFpCount );
System.out.println( "TP of good files: " + goodTpCount );
ARFFVectorIterable tpVectors = new ARFFVectorIterable(badArff, new
MapBackedARFFModel());
Iterator<Vector> tpIterator = tpVectors.iterator();
int badTpCount = 0;
int badFpCount = 0;
while (tpIterator.hasNext())
{
Vector badVector = tpIterator.next();
double probabilityBad =
bestClassifier.classify(badVector).get(badLabel);
testCount++;
if( probabilityBad > 0.0 ) {
if( probabilityBad <= 1.0 ) {
System.out.print( probabilityBad + "," );
}
badTpCount++;
}
else {
badFpCount++;
}
}
System.out.println();
System.out.println( "TP count: " + badTpCount );
System.out.println( "FP on bad clusters: " + badFpCount);
System.out.println( "Test count: " + testCount );
Any help is appreciated!
Take care,
-stu