Hi,

On Fri, Dec 19, 2014 at 8:54 AM, Vihari Piratla <[email protected]> wrote:
> Hello OpenNLP user community,
> I read in the documentation that the training file should contain 15,000
> sentences to achieve a decent performance; Can you explain or point me to
> relevant documentation that explains this number.

I do not know the origin the 15.000 sentences assertion, perhaps that
is because the CoNLL 2003 dataset for English, contains that number of
sentences. Note that the number of entities per class is also
important because if you have data which is very sparse, it is
difficult to learn. In the ConLL training set there are around 24000
entities for the 4 classes, person, org, loc and misc. Of which

3438 are misc
7140 are locations
6321 are organizations
6600 are persons.

> Also can you help me understand, why the performance (especially recall) is
> so bad with the OpenNLP person model with OpenNLP Entity Recogniser?
> What can I do to improve this?

It all depends which data are you annotating. It is usually best to
train your own models for the domain data you want to annotate,
otherwise the performance of the model suffers.

Cheers,

R

Reply via email to