Thanks, I am unfamiliar with the approaches that you mentioned - will
investigate. I forgot to mention that this is a multi-class classification
problem. Each sample represents a page of a corpus of document that have been
scanned and text extracted using OCR (thus noisy text)
Label | Samples | %-------+---------+----------------C1 | 131613 |
97.71C2 | 873 | 0.65C3 | 830 | 0.62C4 | 492 | 0.37C5
| 456 | 0.34C6 | 430 | 0.32
- viraf
On Wednesday, July 3, 2019, 10:31:44 AM EDT, Dan Russ <[email protected]>
wrote:
Have you considered using outlier detection methods? I’m not really an expert
on this, but maybe you can define your majority class very well, and the other
class is the outlier. Another option may be one-sided classification
(https://en.wikipedia.org/wiki/One-class_classification), SVDD is an example of
this. Finally, you might want to look at data augmentation techniques. I am in
the middle of some work using conditional GANs, but it is not working out so
great for me at the moment.
Let me know if any of these work out for you.
Daniel
> On Jul 3, 2019, at 10:22 AM, [email protected] wrote:
>
> I am trying document classification using OpenNLP however my data is highly
> unbalanced (majority class is 97%). I recognize that I could randomly
> over/under sample the data set, and am reading up on SMOTE and ADASYN (not
> sure how to apply these to OpenNLP).