Thanks for your reply, Jim, I see your point about edit distance. I was using Levenstein distance (as well as some other similarity measures like Tversky index) in my first attempt to find similar words. It didn't work for me that well, because my similarity constraints are much richer than the number of character inserts/removals. For example, s1 is similar to s2 if s1 is an acronym of s2, or if s1 is a synonym of s2. I thought by turning this into a classification problem I can predict most prominent features based on my training set.
Thanks, On Wed, Oct 2, 2013 at 4:03 AM, Jim - FooBar(); <[email protected]>wrote: > The most straight-forward approach of doing what you want would be to > define a string-similarity measure (i.e. Levenshtein-distance) and then > simply for each word in S2, iterate S1 and disregard all the occurrences of > words that return more than some predefined distance value. You are > actually overcomplicating the problem by using maxent... > > hope that helps, > Jim > > > > > On 02/10/13 04:10, George Ramonov wrote: > >> Hi everyone, >> >> I am new to OpenNLP maxent classifier, and I have a question regarding >> using features that are label-dependent. >> >> I have two sets of words (S1 and S2, where ||S1|| >> ||S2||), and I am >> trying to create find words from S2 that are most similar to S1 using >> features I designed. I turned this into a classification problem, treating >> words from S2 as labels, and built a nice training set. However, my >> features are dependent on the labels itself. I can't find a simple way in >> OpenNLP to utilize labels in the prediction process. My guess is I would >> have to subclass MaxentModel and implement eval() method? Is there an >> easier way to solve this problem? Or perhaps, maximum entropy is not the >> best algorithm of choice? >> >> Thanks, >> George >> >> >
