Thanks for the quick response. Some follow up questions Is it essential to annotate entities as "misc" class too?
It is usually best to train your own models for the domain data you want to annotate, otherwise the performance of the model suffers. Isn't it hard to generate accurate 15,000 annotated sentences for every domain data that I wish to recognise? (just want to make sure that I am not missing anything) Thank you On Fri, Dec 19, 2014 at 1:41 PM, Rodrigo Agerri <[email protected]> wrote: > > Hi, > > On Fri, Dec 19, 2014 at 8:54 AM, Vihari Piratla <[email protected]> > wrote: > > Hello OpenNLP user community, > > I read in the documentation that the training file should contain 15,000 > > sentences to achieve a decent performance; Can you explain or point me to > > relevant documentation that explains this number. > > I do not know the origin the 15.000 sentences assertion, perhaps that > is because the CoNLL 2003 dataset for English, contains that number of > sentences. Note that the number of entities per class is also > important because if you have data which is very sparse, it is > difficult to learn. In the ConLL training set there are around 24000 > entities for the 4 classes, person, org, loc and misc. Of which > > 3438 are misc > 7140 are locations > 6321 are organizations > 6600 are persons. > > > Also can you help me understand, why the performance (especially recall) > is > > so bad with the OpenNLP person model with OpenNLP Entity Recogniser? > > What can I do to improve this? > > It all depends which data are you annotating. It is usually best to > train your own models for the domain data you want to annotate, > otherwise the performance of the model suffers. > > Cheers, > > R > -- V
