Really surprised.  Looking at the documentation, your training data should be 
in the following format. See 
(https://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.doccat.training)

Is_cat_1 <text>
Is_not_cat_1 <text>

Is that how you formatted your data?
Daniel

> On Oct 17, 2018, at 3:50 PM, Benedict Holland <[email protected]> 
> wrote:
> 
> Hi! Thanks for the reply.
> 
> Yes. There is a massive imbalance.  Out of the thousands of observations I
> have, only a small handful are actually positive observations in the
> is_cat_1. The rest are in the is_not_cat_1. In some cases, the number of
> positives are 1.
> 
> For example:
> 
> In one category, the only observation in is_cat_1 is:
> assault use reckless force or vi
> 
> I have a bunch of observations in the is_not_cat_1. This model placed this
> text
> 
> 0099 usc 18 usc 2
> 
> has a probability match over 90%. Mind you, I expected this setup to be
> horrible. I actually expected this sort of text to get close to a 100% in
> the is_not_cat_1 but what I really cannot explain is the overlap. I did
> verify that the BoW produces the following features:
> ["bow=assault", "bow=use", "bow=recklass", "bow=force", "bow=or", "bow="vi"]
> 
> The only thing I could come up with is something like each string being
> broken apart into individual letters but that wouldn't make sense. Or would
> it?
> 
> Thanks,
> ~Ben
> 
> On Wed, Oct 17, 2018 at 3:26 PM Dan Russ <[email protected]> wrote:
> 
>> Hi Ben,
>>   Are you sure that your training documents are formatted appropriately?
>> Also, do you have a large imbalance in the # of training documents?  If the
>> text in the testing document is not in either CAT_1 or the OTHER_CAT, there
>> will be a .5 assignment to each category (assuming equal documents so the
>> prior doesn’t change the value).  A .5 assignment is like “I can’t tell the
>> two categories”.  You probably don’t want to think of it as “You don’t look
>> like CAT_1 so you are NOT_CAT_1”.
>> Daniel
>> 
>>> On Oct 17, 2018, at 1:14 PM, Benedict Holland <
>> [email protected]> wrote:
>>> 
>>> Hello all,
>>> 
>>> I can't quite figure out how the Doccat MaxEnt modeling works. Here is my
>>> setup:
>>> 
>>> I have a set of training texts split into is_cat_1 and is_not_cat_1. I
>>> train my model using the default bag of words model. I have a document
>>> without any overlapping text with texts that are in is_cat_1. They might
>>> overlap with text in is_not_cat_1. Meaning, every single word in the
>>> document I want to categorize does not appear in any of the model
>> training
>>> data in the is_cat_1 category.
>>> 
>>> The result of the MaxEnt model for my document is a probability over 90%
>>> that it fits into the is_cat_1. Why is that?
>>> 
>>> Thanks,
>>> ~Ben
>> 
>> 

Reply via email to