Re: Training new models

Jörn Kottmann Fri, 28 Mar 2014 02:05:09 -0700

On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:

Hi,




I am new to OpenNLP.  I've been playing with chunking, tokenizing, POS
tagging, and Name recognition for a few days.  I've been following the
example code and using preexisting models from
http://opennlp.sourceforge.net/models-1.5/.  I've been having some trouble
with name recognition and organization recognition in that using the above
mentioned models I can only identify common names or organizations like
"Mike Smith" and "IBM".  In addition I need to be able to find date ranges
and technical language like "Java", "C++", and "HTML" (I should mention
that my input is going to be resumes).



I figured I need to train my own models, especially since my training data
should look more like my input to give a better context (i.e. resumes).
I've been trying to find some information on how to do this in the
documentation and also doing google searches.  I found a few simple
examples, but not much more.  I did see the example in the documentation
with the "<START:person> <END>" tags and the command line to process the
training data into a .bin file, but nothing with organization names.  I
tried to look at one or two of the annotation guides and that created more
questions than answers (for example, the annotation guides not consistent
with each other or the example in the documentation.  Are there pros and
cons between the different formats?  Are the examples in the documentation
in a native format?  Is there a conversion utility?  If so and I'm creating
data from scratch, would it not be better to just put it in the native
format?)



I just lack understanding of OpenNLP and NLP in general and the OpenNLP
Manual just hasn't worked for me.  Maybe I'm just misinterpreting the
documentation or just not looking in the right place.  I would appreciate
it greatly if someone could point me in the right direction in the way of
real world examples of training a model, recommending a book I can read
through, or maybe just some good examples of training data.  Beyond the
specific task I'm trying to accomplish, I would like to get a deeper
understanding of how OpenNLP works.


Hello,

the OpenNLP Name Finder training format is rather simple, as you alreadyfigured out, youneed to use the <START:entity_name> and <END> tags to mark the name intokenized

plain text documents.

In the example above you could replace <START:person> with<START:organization> to markup

an organization name in your text.

To create a model which performs on your documents you will have tolabel quite a few of themand using a text editor to insert the tags is an approach which does notscale for more than

a few documents.

I suggest to have a look at brat:
http://brat.nlplab.org/

Brat has a few issues in the 1.3 release version, but they are nowresolved in the trunk,

I recommend to use it instead of 1.3.

The OpenNLP Name Finder in the trunk version can be directly trained onthe brat format.If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 toconvert the data into the

above discussed OpenNLP format.

I know a few people who have done this successfully. Let us know if youhave an issues, and a contribution

about this process to our documentation would be very welcome!

HTH,
Jörn

Re: Training new models

Reply via email to