Thanks Jörn, The corpora contains document from variety of sources and it was released last year. Should fit in our case , will check this too.
Raj -----Original Message----- From: Jörn Kottmann [mailto:[email protected]] Sent: Monday, November 10, 2014 2:10 PM To: [email protected] Subject: Re: Corpora used for training OpenNLP english models On 11/05/2014 08:14 AM, Rodrigo Agerri wrote: > Hi Raj, > > I believe that the NameFinder models were trained with MUC, but I am > not sure. In any case, if you are going to annotate a different domain > to that of MUC, you will better off annotating data for that domain > because supervised approaches do not adapt well when used in other > genres/domains. > The English name finder models are trained on MUC 6 / 7 plus some corrections to solve certain detection problems. I suggest not to use MUC anymore because it is quite dated. If you want to train name finder models which perform well I suggest to have a look at OntoNotes 4.0. We have support to train OpenNLP models directly on it. The data is not free, we had to pay around 50 USD to get it. There is now also a newer version 5.0: https://catalog.ldc.upenn.edu/LDC2013T19 I guess the format of it didn't change to much, so there is a good chance it runs with the 4.0 parsing code. HTH, Jörn ________________________________
