What Kate says is good advice.  You can have considerable amounts of bias,
but you may be telling the model something about the relative cost of
errors and that can result in things happening that you don't like.

As you noted, your model could have gotten 95% correct by simply saying
DON'T CARE to all documents.  This is a "stopped-clock" model which is only
accidentally and uninterestingly correct.

If you down-sample the DON'T CARE you can have better results because the
model cannot cheat so easily.

Another thing to try is to build a binary model of CARE/DON'T CARE.  Then
only on documents that are CARE, build a model of topic.  This sort of
cascaded model can sometimes be much more accurate.  Downsampling DON'T
CARE in the first step may still be a good thing.  In the second step, it
is irrelevant.

On Wed, Dec 5, 2012 at 12:44 AM, Kate Ericson <eric...@cs.colostate.edu>wrote:

> So just from an a ML approach, you want to have roughly equal amounts of
> your target classifications in your training data - keep some "DON'T
> CARE"s, but try not to have significantly more of them than any of your 10
> target classes.
>
> I hope this helps!
>
> -Kate
>
>
> On Tue, Dec 4, 2012 at 4:26 PM, mahout-newbie <raman.sriniva...@gmail.com
> >wrote:
>
> > I am trying to classify a set of short text descriptions (1 - 15 words
> > long)
> > into a handful of classes.  I am following the approach in the 20
> newsgroup
> > example using Adaptive Logistic Regression.
> >
> > There are a couple of twists to the problem I am solving:
> > 1) Only a small set of descriptions result in useful classifications -
> bit
> > like finding a needle in a haystack. My training data has a grab bag
> > classification called "DON'T CARE" into which 90-95% of the descriptions
> > end
> > up. The remaining 5-10% are classified into roughly 10 classes.
> > 2) There are certain words (features) in the description that immediately
> > imply its classification unambiguously. However, they do not occur very
> > frequently in the data set.
> >
> > When I train and test with this data set, the overall classification
> > accuracy is very high (98%) except a high proportion of the incorrect
> > classifications occur for the descriptions I am most interested in. I
> guess
> > one should expect this intuitively!
> >
> > What's the best way to model this scenario? Is it better to exclude the
> > "DON'T CARE" descriptions from the training set for SGD? Because when the
> > proportions accurately reflect the real data set, the classification
> error
> > rate for the interesting subset is too high!
> >
> > Appreciate any ideas...
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Seeking-advice-on-a-classification-problem-needle-in-haystack-situation-tp4024362.html
> > Sent from the Mahout User List mailing list archive at Nabble.com.
> >
>

Reply via email to