Really surprised. Looking at the documentation, your training data should be in the following format. See (https://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.doccat.training)
Is_cat_1 <text> Is_not_cat_1 <text> Is that how you formatted your data? Daniel > On Oct 17, 2018, at 3:50 PM, Benedict Holland <[email protected]> > wrote: > > Hi! Thanks for the reply. > > Yes. There is a massive imbalance. Out of the thousands of observations I > have, only a small handful are actually positive observations in the > is_cat_1. The rest are in the is_not_cat_1. In some cases, the number of > positives are 1. > > For example: > > In one category, the only observation in is_cat_1 is: > assault use reckless force or vi > > I have a bunch of observations in the is_not_cat_1. This model placed this > text > > 0099 usc 18 usc 2 > > has a probability match over 90%. Mind you, I expected this setup to be > horrible. I actually expected this sort of text to get close to a 100% in > the is_not_cat_1 but what I really cannot explain is the overlap. I did > verify that the BoW produces the following features: > ["bow=assault", "bow=use", "bow=recklass", "bow=force", "bow=or", "bow="vi"] > > The only thing I could come up with is something like each string being > broken apart into individual letters but that wouldn't make sense. Or would > it? > > Thanks, > ~Ben > > On Wed, Oct 17, 2018 at 3:26 PM Dan Russ <[email protected]> wrote: > >> Hi Ben, >> Are you sure that your training documents are formatted appropriately? >> Also, do you have a large imbalance in the # of training documents? If the >> text in the testing document is not in either CAT_1 or the OTHER_CAT, there >> will be a .5 assignment to each category (assuming equal documents so the >> prior doesn’t change the value). A .5 assignment is like “I can’t tell the >> two categories”. You probably don’t want to think of it as “You don’t look >> like CAT_1 so you are NOT_CAT_1”. >> Daniel >> >>> On Oct 17, 2018, at 1:14 PM, Benedict Holland < >> [email protected]> wrote: >>> >>> Hello all, >>> >>> I can't quite figure out how the Doccat MaxEnt modeling works. Here is my >>> setup: >>> >>> I have a set of training texts split into is_cat_1 and is_not_cat_1. I >>> train my model using the default bag of words model. I have a document >>> without any overlapping text with texts that are in is_cat_1. They might >>> overlap with text in is_not_cat_1. Meaning, every single word in the >>> document I want to categorize does not appear in any of the model >> training >>> data in the is_cat_1 category. >>> >>> The result of the MaxEnt model for my document is a probability over 90% >>> that it fits into the is_cat_1. Why is that? >>> >>> Thanks, >>> ~Ben >> >>
