On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:
Hi,
I am new to OpenNLP. I've been playing with chunking, tokenizing, POS
tagging, and Name recognition for a few days. I've been following the
example code and using preexisting models from
http://opennlp.sourceforge.net/models-1.5/. I've been having some trouble
with name recognition and organization recognition in that using the above
mentioned models I can only identify common names or organizations like
"Mike Smith" and "IBM". In addition I need to be able to find date ranges
and technical language like "Java", "C++", and "HTML" (I should mention
that my input is going to be resumes).
I figured I need to train my own models, especially since my training data
should look more like my input to give a better context (i.e. resumes).
I've been trying to find some information on how to do this in the
documentation and also doing google searches. I found a few simple
examples, but not much more. I did see the example in the documentation
with the "<START:person> <END>" tags and the command line to process the
training data into a .bin file, but nothing with organization names. I
tried to look at one or two of the annotation guides and that created more
questions than answers (for example, the annotation guides not consistent
with each other or the example in the documentation. Are there pros and
cons between the different formats? Are the examples in the documentation
in a native format? Is there a conversion utility? If so and I'm creating
data from scratch, would it not be better to just put it in the native
format?)
I just lack understanding of OpenNLP and NLP in general and the OpenNLP
Manual just hasn't worked for me. Maybe I'm just misinterpreting the
documentation or just not looking in the right place. I would appreciate
it greatly if someone could point me in the right direction in the way of
real world examples of training a model, recommending a book I can read
through, or maybe just some good examples of training data. Beyond the
specific task I'm trying to accomplish, I would like to get a deeper
understanding of how OpenNLP works.
Hello,
the OpenNLP Name Finder training format is rather simple, as you already
figured out, you
need to use the <START:entity_name> and <END> tags to mark the name in
tokenized
plain text documents.
In the example above you could replace <START:person> with
<START:organization> to markup
an organization name in your text.
To create a model which performs on your documents you will have to
label quite a few of them
and using a text editor to insert the tags is an approach which does not
scale for more than
a few documents.
I suggest to have a look at brat:
http://brat.nlplab.org/
Brat has a few issues in the 1.3 release version, but they are now
resolved in the trunk,
I recommend to use it instead of 1.3.
The OpenNLP Name Finder in the trunk version can be directly trained on
the brat format.
If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to
convert the data into the
above discussed OpenNLP format.
I know a few people who have done this successfully. Let us know if you
have an issues, and a contribution
about this process to our documentation would be very welcome!
HTH,
Jörn