>> We've been playing around with a number of different parameters, feature
>> selection, etc. and are able to achieve pretty good results in
>> cross-validation.
>>
>>When you say cross validation, do you mean the magic cross validation that
>>the ALR uses?  Or do you mean your 20%?

I mean the 20%.  Does the ALR algorithm do it's own cross validation?  I was 
under the impression that it did training and testing steps with a percentage 
split based on the number of something (CrossFoldLearners?) in the object.  Is 
that correct?  As I said, we've been holding back 20% to do our own cross 
validation.

>>  We have a ton of different metrics we're tracking on the results, most
>> significant to this discussion is that it looks like we're achieving very
>> good precision (typically >.85 or .9) and a good f1-score (typically again
>> >.85 or .9).
>>
>>These are extremely good results.   In fact they are good enough I would
>>starting thinking about a target leak.

The possibility of a target leak is interesting as it hadn't occurred to me 
previously.  However, thinking it through I'm less inclined to think it's a 
possibility.  We wrote a simple program to extract the model features and 
weights and I would think a leak would be obvious there, yes?  The terms we're 
seeing seem to make sense.

>>However, when we then take the models generated and try to apply them to
>> some new documents, we're getting many more false positives than we would
>> expect.  Documents that should have 2 categories are testing positive for
>> 16, which is well above what I'd expect.  By my math I should expect 2 true
>> positives, plus maybe 4.4 (.10 false positives * 44 classes) additional
>> false positives.
>>
>>
>>You said documents.  Where do these documents come from?

Sorry, to clarify all of our inputs are documents.  Specifically, they're 
technical (scientific) papers written by people at our company.  The documents 
are indexed in SOLR, and we use the Mahout lucene vector to extract our data.  
We started our development of this process a couple of months ago and took an 
extract from SOLR at that time.  The new documents we're trying to classify 
after settling on a model are those that have come in to SOLR after that 
extraction took place.

>>One way to get results just like you describe is if you train on raw news
>>wire that is split randomly between training and test.  What can happen is
>>that stories that get edited and republished have a high chance of getting
>>at least one version in both training and test.  This means that the
>>supposedly independent test set actually has significant overlap with the
>>training set.  If your classifier over-fits, then the test set doesn't
>>catch the problem.

I don't believe this is happening, but it is worth checking into.  

>>Another way to get this sort of problem is if you do your training/test
>>randomly, but the new documents come from a later time.  If your classifier
>>is a good classifier, but is highly specific to documents from a particular
>>moment in time, then your test performance will be a realistic estimate of
>>performance for contemporaneous documents but will be much higher than
>>performance on documents from a later point in time.

The temporal aspect is an interesting one.  I will have to check on that.

>>A third option could happen if your training and test sets were somehow
>>scrubbed of poorly structured and invalid documents.  This often happens.
>>Then, in the real system, if the scrubbing is not done, the classifier may
>>fail because the new documents are not scrubbed in the same way as the
>>training documents.

I think we've handled this.  I'm processing "new" documents programmatically 
through an analysis chain that I believe accurately mimics the one that I 
indexed against in SOLR.  The results were complete garbage before I made them 
match exactly.  In addition, wouldn't I expect more false negatives than false 
positives if that was the case?

>>Well, I think that, almost by definition, you have an overfitting problem
>>of some kind.  The question is what kind.  The only think that I think that
>>you don't have is a frank target leak in your documents.  That would
>>(probably) have given you even higher scores on your test case.
Is there any easy way to detect an overfit?  We've noticed at least one 
interesting thing that seem to be typical of the bad models.  For each class a 
percentage "confidence" score is reported.  With our binary models obviously 
the choices are 0 or 1.   The bad models tend to be very certain in their 
answers -- e.g. it's either >99% certain it is or isn't a particular class.  Is 
that indicative of overfitting, or completely unrelated?

THANKS!
Ian

Reply via email to