Hi, Turns out that it was something easy to do. I created a class TokenTag to hold a token and its postag. Then I changed the Featurizer to work with BeamSearch<TokenTag> and SenquenceValidator<TokenTag>. With this change we can access the token and its postag from inside the sequence validator.
For now I am only validating the features using a tag dictionary. The accuracy now in a 10-fold cross-validation using the brazilian corpus is 97.142%. The accuracy should increase if I modify the evaluator: if the Featurizer selects, for example, male as the gender of a token, but according to the corpus it has two genders, the evaluator considers it as an error. Thank you, William On Thu, Feb 2, 2012 at 2:13 AM, William Colen <[email protected]> wrote: > Hi, > > I am trying to develop an OpenNLP based learnable featurizer. It can > attach tags like gender, number, mood, person and verb tense. The input is > the sentence tokens and the POS Tags. > The context generator I am using is based on the one from Chunker, plus > some prefix and suffix features. > > The current accuracy is 95,395%, but I think I can improve it using a > sequence validator. > > Question: > Is it possible to create a sequence validator that, besides the tokens, > also knows the POS Tags? I would like to check if the combination POS Tag + > features is OK (tense tags only for verbs for example). > > Thank you in advance. If it works, and you think it is a good tool, I will > contribute the featurizer to OpenNLP. > > William > > > > >
