Re: Document Classification with imbalanced data

[email protected] Wed, 03 Jul 2019 08:49:24 -0700

 Thanks, I am unfamiliar with the approaches that you mentioned - will 
investigate.  I forgot to mention that this is a multi-class classification 
problem.  Each sample represents a page of a corpus of document that have been 
scanned and text extracted using OCR (thus noisy text)
Label  | Samples | %-------+---------+----------------C1     | 131613  | 
97.71C2     |    873  |  0.65C3     |    830  |  0.62C4     |    492  |  0.37C5 
    |    456  |  0.34C6     |    430  |  0.32
- viraf

    On Wednesday, July 3, 2019, 10:31:44 AM EDT, Dan Russ <[email protected]> 
wrote:  

 Have you considered using outlier detection methods?  I’m not really an expert 
on this, but maybe you can define your majority class very well, and the other 
class is the outlier.  Another option may be one-sided classification 
(https://en.wikipedia.org/wiki/One-class_classification), SVDD is an example of 
this. Finally, you might want to look at data augmentation techniques.  I am in 
the middle of some work using conditional GANs, but it is not working out so 
great for me at the moment.

Let me know if any of these work out for you.
Daniel

> On Jul 3, 2019, at 10:22 AM, [email protected] wrote:
> 
> I am trying document classification using OpenNLP however my data is highly 
> unbalanced (majority class is 97%).  I recognize that I could randomly 
> over/under sample the data set, and am reading up on SMOTE and ADASYN (not 
> sure how to apply these to OpenNLP).

Re: Document Classification with imbalanced data

Reply via email to