On 13-10-01 11:10 PM, George Ramonov wrote:
Hi everyone,
I am new to OpenNLP maxent classifier, and I have a question regarding
using features that are label-dependent.
I have two sets of words (S1 and S2, where ||S1|| >> ||S2||), and I am
trying to create find words from S2 that are most similar to S1 using
features I designed. I turned this into a classification problem, treating
words from S2 as labels, and built a nice training set. However, my
features are dependent on the labels itself. I can't find a simple way in
OpenNLP to utilize labels in the prediction process. My guess is I would
have to subclass MaxentModel and implement eval() method? Is there an
easier way to solve this problem? Or perhaps, maximum entropy is not the
best algorithm of choice?
You cannot use the label in your features because it is unknown at
prediction time. You can however use the set of all possible labels to
compute features. For example, if one of your feature is the
edit-distance, you can compute the edit-distance of a word to each
possible label. Another option is to add a feature to specify the label
with the minimal edit-distance. If your possible labels are "w1" and
"w2", a feature vector could look like :
"edit-distance to w1", "edit-distance to w2", "1 if w1 has smallest edit
distance, 0 otherwise", "1 if w2 has smallest edit distance, 0 otherwise"
From there, you can easily generalize to many features and many labels.
Hope this help,
Alexandr