Re: [jira] [Commented] (OPENNLP-226) Evaluators should allow tools to register a misclassified report interface

Jörn Kottmann Fri, 12 Aug 2011 04:04:57 -0700

On 8/12/11 12:53 PM, [email protected] wrote:

Should I iterate over the training data or do it after model training? I
thought that not every tag would be in the outcome list because of the
cutoff. Also it would be difficult to preview which tags would be at the
outcome list while performing cross validation because we train with a
subset of the corpus.

Well there you got two points. You can try to use the perceptron, thatis usuallytrained without a cutoff. Anyway that doesn't really help you for thecross validation.Maybe you can add a little training data to your corpus, so you arecovering all tags?

If you know the tags which are causing trouble you might just want toremove alltokens from your dictionary which contain them. Removing a few wordswill not

make a big difference in accuracy anyway.

Sorry for not having a better answer.

Our current POS Tagger is completely statistical, to improve yoursituation we wouldneed an hybrid approach, where we it can fallback to some rules in casethe statistical

decision is not plausible according to a tag dict, or other rules.

We also had a user here, who wanted to define short sequences in a tagdict, to fix mistakes

he observed in the output of the tagger.

Maybe both things could be done for 1.6. What do you think?

Jörn

Re: [jira] [Commented] (OPENNLP-226) Evaluators should allow tools to register a misclassified report interface

Reply via email to