On 8/12/11 12:53 PM, [email protected] wrote:
Should I iterate over the training data or do it after model training? I
thought that not every tag would be in the outcome list because of the
cutoff. Also it would be difficult to preview which tags would be at the
outcome list while performing cross validation because we train with a
subset of the corpus.
Well there you got two points. You can try to use the perceptron, that
is usually
trained without a cutoff. Anyway that doesn't really help you for the
cross validation.
Maybe you can add a little training data to your corpus, so you are
covering all tags?
If you know the tags which are causing trouble you might just want to
remove all
tokens from your dictionary which contain them. Removing a few words
will not
make a big difference in accuracy anyway.
Sorry for not having a better answer.
Our current POS Tagger is completely statistical, to improve your
situation we would
need an hybrid approach, where we it can fallback to some rules in case
the statistical
decision is not plausible according to a tag dict, or other rules.
We also had a user here, who wanted to define short sequences in a tag
dict, to fix mistakes
he observed in the output of the tagger.
Maybe both things could be done for 1.6. What do you think?
Jörn