Hi, On Fri, Dec 19, 2014 at 8:54 AM, Vihari Piratla <[email protected]> wrote: > Hello OpenNLP user community, > I read in the documentation that the training file should contain 15,000 > sentences to achieve a decent performance; Can you explain or point me to > relevant documentation that explains this number.
I do not know the origin the 15.000 sentences assertion, perhaps that is because the CoNLL 2003 dataset for English, contains that number of sentences. Note that the number of entities per class is also important because if you have data which is very sparse, it is difficult to learn. In the ConLL training set there are around 24000 entities for the 4 classes, person, org, loc and misc. Of which 3438 are misc 7140 are locations 6321 are organizations 6600 are persons. > Also can you help me understand, why the performance (especially recall) is > so bad with the OpenNLP person model with OpenNLP Entity Recogniser? > What can I do to improve this? It all depends which data are you annotating. It is usually best to train your own models for the domain data you want to annotate, otherwise the performance of the model suffers. Cheers, R
