Hi Ben,

   if a document that can be in multiple categories, you should see it 
reflected in the probabilities.  The top categories will be close in score.  It 
will not be 1/m because that would imply that ALL categories are “equally 
probable” or you have no idea.  However, if you have 3 classes and two are 
likely, it may be 0.49,0.49,0.02.  Remember that the results are normalized 
with by a softmax at the end. So the sum of all probabilities will be always 1.
   Sorry, but multi-class classification is more complicated that binary 
classification.  If you really are interested in multi-label classification, 
I’m not sure maxent (at least the way openNLP formulated the solution) is 
appropriate for your needs.  You might want to consider individual binary 
classifiers for each label.  Have 1 model for each label:

train_cat1.txt...
cat_1_TRUE <text>   
cat_1_FALSE <text>
…

train_cat2.txt…
cat_2_FALSE <text>
cat_2_TRUE <text>

Hope it helps, Let me know what you wind up doing...
Daniel
  
> On Apr 12, 2018, at 4:22 PM, Benedict Holland <[email protected]> 
> wrote:
> 
> Hello all,
> 
> I understand that maximum entropy models are excellent at categorizing
> documents. As it turns out, I have a situation where 1 document can be in
> many categories (1:m relationship). I believe that I could create training
> data that looks something like:
> 
> category_1 <text>
> category_2 <text>
> ...
> 
> If I do this, will the resulting probability model return category
> probabilities as Pr(<text> in category_m) = 1/m for all categories m or it
> return Pr(<text> in category_m) = 1 for all categories m?
> 
> This is a very important distinction. I really hope it is the later. If it
> isn't, do you have a way to make sure that if I receive a text that is
> similar to the training data, I can get a probability close to 1 if it fits
> into multiple categories?
> 
> Thanks,
> ~Ben

Reply via email to