On 8/12/11 12:53 PM, [email protected] wrote:
Should I iterate over the training data or do it after model training? I
thought that not every tag would be in the outcome list because of the
cutoff. Also it would be difficult to preview which tags would be at the
outcome list while performing cross validation because we train with a
subset of the corpus.

Well there you got two points. You can try to use the perceptron, that is usually trained without a cutoff. Anyway that doesn't really help you for the cross validation. Maybe you can add a little training data to your corpus, so you are covering all tags?

If you know the tags which are causing trouble you might just want to remove all tokens from your dictionary which contain them. Removing a few words will not
make a big difference in accuracy anyway.

Sorry for not having a better answer.

Our current POS Tagger is completely statistical, to improve your situation we would need an hybrid approach, where we it can fallback to some rules in case the statistical
decision is not plausible according to a tag dict, or other rules.

We also had a user here, who wanted to define short sequences in a tag dict, to fix mistakes
he observed in the output of the tagger.

Maybe both things could be done for 1.6. What do you think?

Jörn

Reply via email to