Inline

On Mon, Dec 2, 2013 at 8:55 AM, optimusfan <optimus...@yahoo.com> wrote:

> ... To accomplish this, we used AdaptiveLogisticRegression and trained 46
> binary classification models.  Our approach has been to do an 80/20 split
> on the data, holding the 20% back for cross-validation of the models we
> generate.
>

Sounds reasonable.


> We've been playing around with a number of different parameters, feature
> selection, etc. and are able to achieve pretty good results in
> cross-validation.


When you say cross validation, do you mean the magic cross validation that
the ALR uses?  Or do you mean your 20%?


>  We have a ton of different metrics we're tracking on the results, most
> significant to this discussion is that it looks like we're achieving very
> good precision (typically >.85 or .9) and a good f1-score (typically again
> >.85 or .9).


These are extremely good results.   In fact they are good enough I would
starting thinking about a target leak.

 However, when we then take the models generated and try to apply them to
> some new documents, we're getting many more false positives than we would
> expect.  Documents that should have 2 categories are testing positive for
> 16, which is well above what I'd expect.  By my math I should expect 2 true
> positives, plus maybe 4.4 (.10 false positives * 44 classes) additional
> false positives.
>

You said documents.  Where do these documents come from?

One way to get results just like you describe is if you train on raw news
wire that is split randomly between training and test.  What can happen is
that stories that get edited and republished have a high chance of getting
at least one version in both training and test.  This means that the
supposedly independent test set actually has significant overlap with the
training set.  If your classifier over-fits, then the test set doesn't
catch the problem.

Another way to get this sort of problem is if you do your training/test
randomly, but the new documents come from a later time.  If your classifier
is a good classifier, but is highly specific to documents from a particular
moment in time, then your test performance will be a realistic estimate of
performance for contemporaneous documents but will be much higher than
performance on documents from a later point in time.

A third option could happen if your training and test sets were somehow
scrubbed of poorly structured and invalid documents.  This often happens.
 Then, in the real system, if the scrubbing is not done, the classifier may
fail because the new documents are not scrubbed in the same way as the
training documents.

These are just a few of the ways that *I* have screwed up building
classifiers.  I am sure that there are more.

We suspected that perhaps our models were underfitting or overfitting,
> hence this post.  However, I'll take any and all suggestions for anything
> else we should be looking at.
>

Well, I think that, almost by definition, you have an overfitting problem
of some kind.  The question is what kind.  The only think that I think that
you don't have is a frank target leak in your documents.  That would
(probably) have given you even higher scores on your test case.

Reply via email to