Re: Seeking advice on a classification problem (needle-in-haystack situation)

Ted Dunning Wed, 05 Dec 2012 15:11:32 -0800

A two class classifier is much easier to get right than a many class
classifier.


The cascaded classifier is likely to avoid your problem.

Downsampling the don't-cares will also likely help.  When don't-cares
dominate the data set, the classifier can decrease overall error rates by
failing safe.  That isn't interesting so you have to make that strategy
costly to the classifier.  Downsampling will do this.

Also, what level of regularization are you using?

On Wed, Dec 5, 2012 at 10:16 PM, Raman Srinivasan <
raman.sriniva...@gmail.com> wrote:

> Thanks for the responses. The cascading approach sounds quite interesting.
> My problem though is that many of the useful items ended up in the
> don't-care bucket, not that they were misclassified among the useful
> categories. So, even if I were to use a cascading approach I am afraid that
> many useful cases may be thrown out at the first level even before I can
> sub-class them. What's usually a good approach when less than 5% of the
> data is meaningful.
>
>
> On Wed, Dec 5, 2012 at 10:26 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > Try the cascaded model.  Train the downstream model on data without the
> > don't-care docs or train it on documents that actually get through the
> > upstream model.
> >
> > On Wed, Dec 5, 2012 at 4:50 PM, Raman Srinivasan <
> > raman.sriniva...@gmail.com
> > > wrote:
> >
> > > I can exclude the "don't care" cases from the training set. However,
> the
> > > real data that I need to classify will contain mostly these useless
> > > descriptions which I would like the model to throw out (i.e., classify
> as
> > > "DON'T CARE"). If I only train with examples that are useful then how
> > would
> > > the model learn to discard the rest? If the model is only aware of
> useful
> > > classes, it will end up classifying the useless items into one of these
> > > useful classes using the max. score, correct?
> > >
> > >
> > > On Wed, Dec 5, 2012 at 9:16 AM, Mohit Singh <mohit1...@gmail.com>
> wrote:
> > >
> > > > May I ask why are you giving the dont care examples to the algorithm.
> > > > Cant you weed them out.
> > > > Is adaptive lr the same as weighted lr.. which is used when you have
> > > > unbalanced training examples?
> > > >
> > > > On Wednesday, December 5, 2012, Raman Srinivasan <
> > > > raman.sriniva...@gmail.com>
> > > > wrote:
> > > > > I am trying to classify a set of short text descriptions (1 - 15
> > words
> > > > > long) into a handful of classes.  I am following the approach in
> the
> > 20
> > > > > newsgroup example using Adaptive Logistic Regression.
> > > > >
> > > > > There are a couple of twists to the problem I am solving:
> > > > > 1) Only a small set of descriptions result in useful
> classifications
> > -
> > > > bit
> > > > > like finding a needle in a haystack. My training data has a grab
> bag
> > > > > classification called "DON'T CARE" into which 90-95% of the
> > > descriptions
> > > > > end up. The remaining 5-10% are classified into roughly 10 classes.
> > > > > 2) There are certain words (features) in the description that
> > > immediately
> > > > > imply its classification unambiguously. However, they do not occur
> > very
> > > > > frequently in the data set.
> > > > >
> > > > > When I train and test with this data set, the overall
> classification
> > > > > accuracy is very high (98%) except a high proportion of the
> incorrect
> > > > > classifications occur for the descriptions I am most interested
> in. I
> > > > guess
> > > > > one should expect this intuitively!
> > > > >
> > > > > What's the best way to model this scenario? Is it better to exclude
> > the
> > > > > "DON'T CARE" descriptions from the training set for SGD? Because
> when
> > > the
> > > > > proportions accurately reflect the real data set, the
> classification
> > > > error
> > > > > rate for the interesting subset is too high!
> > > > >
> > > > > Appreciate any ideas...
> > > > >
> > > >
> > > > --
> > > > Mohit
> > > >
> > > > "When you want success as badly as you want the air, then you will
> get
> > > it.
> > > > There is no other secret of success."
> > > > -Socrates
> > > >
> > >
> >
>

Re: Seeking advice on a classification problem (needle-in-haystack situation)

Reply via email to