Hi Ben, Are you sure that your training documents are formatted appropriately? Also, do you have a large imbalance in the # of training documents? If the text in the testing document is not in either CAT_1 or the OTHER_CAT, there will be a .5 assignment to each category (assuming equal documents so the prior doesn’t change the value). A .5 assignment is like “I can’t tell the two categories”. You probably don’t want to think of it as “You don’t look like CAT_1 so you are NOT_CAT_1”. Daniel
> On Oct 17, 2018, at 1:14 PM, Benedict Holland <[email protected]> > wrote: > > Hello all, > > I can't quite figure out how the Doccat MaxEnt modeling works. Here is my > setup: > > I have a set of training texts split into is_cat_1 and is_not_cat_1. I > train my model using the default bag of words model. I have a document > without any overlapping text with texts that are in is_cat_1. They might > overlap with text in is_not_cat_1. Meaning, every single word in the > document I want to categorize does not appear in any of the model training > data in the is_cat_1 category. > > The result of the MaxEnt model for my document is a probability over 90% > that it fits into the is_cat_1. Why is that? > > Thanks, > ~Ben
