Hi Ben,
   Are you sure that your training documents are formatted appropriately?  
Also, do you have a large imbalance in the # of training documents?  If the 
text in the testing document is not in either CAT_1 or the OTHER_CAT, there 
will be a .5 assignment to each category (assuming equal documents so the prior 
doesn’t change the value).  A .5 assignment is like “I can’t tell the two 
categories”.  You probably don’t want to think of it as “You don’t look like 
CAT_1 so you are NOT_CAT_1”.
Daniel

> On Oct 17, 2018, at 1:14 PM, Benedict Holland <[email protected]> 
> wrote:
> 
> Hello all,
> 
> I can't quite figure out how the Doccat MaxEnt modeling works. Here is my
> setup:
> 
> I have a set of training texts split into is_cat_1 and is_not_cat_1. I
> train my model using the default bag of words model. I have a document
> without any overlapping text with texts that are in is_cat_1. They might
> overlap with text in is_not_cat_1. Meaning, every single word in the
> document I want to categorize does not appear in any of the model training
> data in the is_cat_1 category.
> 
> The result of the MaxEnt model for my document is a probability over 90%
> that it fits into the is_cat_1. Why is that?
> 
> Thanks,
> ~Ben

Reply via email to