RE: Corpora used for training OpenNLP english models

Raj Kiran Mon, 10 Nov 2014 03:23:38 -0800

Thanks Jörn,

The corpora contains document from variety of sources and it was released last 
year. Should fit in our case , will check this too.

Raj

-----Original Message-----
From: Jörn Kottmann [mailto:[email protected]]
Sent: Monday, November 10, 2014 2:10 PM
To: [email protected]
Subject: Re: Corpora used for training OpenNLP english models

On 11/05/2014 08:14 AM, Rodrigo Agerri wrote:
> Hi Raj,
>
> I believe that the NameFinder models were trained with MUC, but I am
> not sure. In any case, if you are going to annotate a different domain
> to that of MUC, you will better off annotating data for that domain
> because supervised approaches do not adapt well when used in other
> genres/domains.
>

The English name finder models are trained on MUC 6 / 7 plus some corrections 
to solve certain detection problems.

I suggest not to use MUC anymore because it is quite dated.

If you want to train name finder models which perform well I suggest to have a 
look at OntoNotes 4.0. We have support to train OpenNLP models directly on it.

The data is not free, we had to pay around 50 USD to get it.

There is now also a newer version 5.0:
https://catalog.ldc.upenn.edu/LDC2013T19

I guess the format of it didn't change to much, so there is a good chance it 
runs with the 4.0 parsing code.

HTH,
Jörn

________________________________

RE: Corpora used for training OpenNLP english models

Reply via email to