I see. I can't help you, but others in the list have the knowledge to discuss the topics.
About feature selection, maybe you should refer to the [1]. I remember reading some discussion there, but not 100% sure. Adwait Ratnaparkhi. (1998). Maximum entropy models for natural language ambiguity resolution. Ph.D. Dissertation, University of Pennsylvania. IRCS Tech Report IRCS-98-15. http://www.ircs.upenn.edu/download/techreports/1998/98-15.pdf Regards William On Fri, Mar 2, 2012 at 4:52 PM, Em <[email protected]> wrote: > William, > > thank you for your advice. > > >> In some POS-taggers there are default-features included (I think this is > >> the best name for it from all the ones I read), while others didn't have > >> them. > I should have been more clear about that: I do not specifically mean > those from OpenNLP but a lot of others, too. > And I try to understand basic principles that indicate that having a > normalization-default-feature helps to improve tagging-quality, instead > of just doing trial and error or guessing. > Knowing how the model might be computed from a mathematical point of > view *could* help to understand when to use normalization-features, > however if people with more experience at these topics could explain > another way - I am happy with that, too :). > > You talked about another topic, too: > Data quality. > Are there any metrics that indicate that you have data of good quality? > > For example I tagged several thousand sentences of a specific wiki and > found out that I have a precision of around 90%+ and a recall of around > 55-60%. There are several ways to tune these results - do more > iterations on the training-data, tune some other parameters, tag more > sentences etc.. but what helps me to priorize my options? > > Kind regards, > Em > > > > > > Am 02.03.2012 20:35, schrieb [email protected]: > > Hi, Em, > > > > The OpenNLP default context generator is designed to be portable between > > languages. You should try it and evaluate how your system performs. You > can > > also evaluate your model using the tools provided, for example: > > bin/opennlp POSTaggerCrossValidator > > > > There is no formula to decide if you should include new features. Compare > > the accuracy of other machine learning POS Tagger implementations to > yours. > > Usually researchers can create models with 96,5% accuracy in English POS > > Tagger, but it depends on factors like the quality of the training data, > > size of the training data etc. > > > > You can extend the default context generator to include features that > would > > improve your model effectiveness, by checking some characteristics of the > > data you are working with. > > > > Regards > > William > > > > > > > > On Fri, Mar 2, 2012 at 1:34 PM, Em <[email protected]> wrote: > > > >> Hello, > >> > >> I've read a little bit about POS-tagging and the theory behind that. > >> > >> In some POS-taggers there are default-features included (I think this is > >> the best name for it from all the ones I read), while others didn't have > >> them. > >> > >> Are you explaining somewhere when to include default-features and when > not? > >> > >> Is there a formula one can consult if one has to decide whether to > >> include default-features for normalization and when not? > >> > >> Thank you. > >> > >> Em > >> > > >
