I see. I can't help you, but others in the list have the knowledge to
discuss the topics.

About feature selection, maybe you should refer to the [1]. I remember
reading some discussion there, but not 100% sure.

Adwait Ratnaparkhi. (1998). Maximum entropy models for natural language
ambiguity resolution. Ph.D. Dissertation, University of Pennsylvania. IRCS
Tech Report IRCS-98-15.
http://www.ircs.upenn.edu/download/techreports/1998/98-15.pdf

Regards
William

On Fri, Mar 2, 2012 at 4:52 PM, Em <[email protected]> wrote:

> William,
>
> thank you for your advice.
>
> >> In some POS-taggers there are default-features included (I think this is
> >> the best name for it from all the ones I read), while others didn't have
> >> them.
> I should have been more clear about that: I do not specifically mean
> those from OpenNLP but a lot of others, too.
> And I try to understand basic principles that indicate that having a
> normalization-default-feature helps to improve tagging-quality, instead
> of just doing trial and error or guessing.
> Knowing how the model might be computed from a mathematical point of
> view *could* help to understand when to use normalization-features,
> however if people with more experience at these topics could explain
> another way - I am happy with that, too :).
>
> You talked about another topic, too:
> Data quality.
> Are there any metrics that indicate that you have data of good quality?
>
> For example I tagged several thousand sentences of a specific wiki and
> found out that I have a precision of around 90%+ and a recall of around
> 55-60%. There are several ways to tune these results - do more
> iterations on the training-data, tune some other parameters, tag more
> sentences etc.. but what helps me to priorize my options?
>
> Kind regards,
> Em
>
>
>
>
>
> Am 02.03.2012 20:35, schrieb [email protected]:
> > Hi, Em,
> >
> > The OpenNLP default context generator is designed to be portable between
> > languages. You should try it and evaluate how your system performs. You
> can
> > also evaluate your model using the tools provided, for example:
> > bin/opennlp POSTaggerCrossValidator
> >
> > There is no formula to decide if you should include new features. Compare
> > the accuracy of other machine learning POS Tagger implementations to
> yours.
> > Usually researchers can create models with 96,5% accuracy in English POS
> > Tagger, but it depends on factors like the quality of the training data,
> > size of the training data etc.
> >
> > You can extend the default context generator to include features that
> would
> > improve your model effectiveness, by checking some characteristics of the
> > data you are working with.
> >
> > Regards
> > William
> >
> >
> >
> > On Fri, Mar 2, 2012 at 1:34 PM, Em <[email protected]> wrote:
> >
> >> Hello,
> >>
> >> I've read a little bit about POS-tagging and the theory behind that.
> >>
> >> In some POS-taggers there are default-features included (I think this is
> >> the best name for it from all the ones I read), while others didn't have
> >> them.
> >>
> >> Are you explaining somewhere when to include default-features and when
> not?
> >>
> >> Is there a formula one can consult if one has to decide whether to
> >> include default-features for normalization and when not?
> >>
> >> Thank you.
> >>
> >> Em
> >>
> >
>

Reply via email to