Is the training data used to train the tokenizer models available? 
Specifically, I'm interested in the data used to train the English tokenizer:

http://opennlp.sourceforge.net/models-1.5/en-token.bin

Thanks,
Stuart Robinson

> On Mar 29, 2014, at 10:12 AM, Sanjeev Sharma 
> <[email protected]> wrote:
> 
> Jorn,
> 
> Thanks you for your reply.  Here is what I tried as a simple test:
> 
> - tagged the names on about 20 resumes using "<START:person><END>"
> notation
> - concatenated them into a single text file.
> - created a new .bin file using the following command
> 
>    >opennlp TokenNameFinderTrainer -model persons.bin -lang en -data
> train.txt -encoding UTF-8
> - using this model file and TokenNameFinderModel tried to identify a name
> in one of the resumes I used for training.  (I can post the code if you
> need.)
> 
> Should this work?  If not, what am I doing wrong?
> 
> Thanks,
> Sanjeev.
> 
> -----Original Message-----
> From: Jörn Kottmann [mailto:[email protected]]
> Sent: Friday, March 28, 2014 5:04 AM
> To: [email protected]
> Subject: Re: Training new models
> 
>> On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:
>> Hi,
>> 
>> 
>> 
>> I am new to OpenNLP.  I've been playing with chunking, tokenizing, POS
>> tagging, and Name recognition for a few days.  I've been following the
>> example code and using preexisting models from
>> http://opennlp.sourceforge.net/models-1.5/.  I've been having some
>> trouble with name recognition and organization recognition in that
>> using the above mentioned models I can only identify common names or
>> organizations like "Mike Smith" and "IBM".  In addition I need to be
>> able to find date ranges and technical language like "Java", "C++",
>> and "HTML" (I should mention that my input is going to be resumes).
>> 
>> 
>> 
>> I figured I need to train my own models, especially since my training
>> data should look more like my input to give a better context (i.e.
> resumes).
>> I've been trying to find some information on how to do this in the
>> documentation and also doing google searches.  I found a few simple
>> examples, but not much more.  I did see the example in the
>> documentation with the "<START:person> <END>" tags and the command
>> line to process the training data into a .bin file, but nothing with
>> organization names.  I tried to look at one or two of the annotation
>> guides and that created more questions than answers (for example, the
>> annotation guides not consistent with each other or the example in the
>> documentation.  Are there pros and cons between the different formats?
>> Are the examples in the documentation in a native format?  Is there a
>> conversion utility?  If so and I'm creating data from scratch, would
>> it not be better to just put it in the native
>> format?)
>> 
>> 
>> 
>> I just lack understanding of OpenNLP and NLP in general and the
>> OpenNLP Manual just hasn't worked for me.  Maybe I'm just
>> misinterpreting the documentation or just not looking in the right
>> place.  I would appreciate it greatly if someone could point me in the
>> right direction in the way of real world examples of training a model,
>> recommending a book I can read through, or maybe just some good
>> examples of training data.  Beyond the specific task I'm trying to
>> accomplish, I would like to get a deeper understanding of how OpenNLP
> works.
> 
> Hello,
> 
> the OpenNLP Name Finder training format is rather simple, as you already
> figured out, you need to use the <START:entity_name> and <END> tags to
> mark the name in tokenized plain text documents.
> 
> In the example above you could replace <START:person> with
> <START:organization> to markup an organization name in your text.
> 
> To create a model which performs on your documents you will have to label
> quite a few of them and using a text editor to insert the tags is an
> approach which does not scale for more than a few documents.
> 
> I suggest to have a look at brat:
> http://brat.nlplab.org/
> 
> Brat has a few issues in the 1.3 release version, but they are now
> resolved in the trunk, I recommend to use it instead of 1.3.
> 
> The OpenNLP Name Finder in the trunk version can be directly trained on
> the brat format.
> If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to
> convert the data into the above discussed OpenNLP format.
> 
> I know a few people who have done this successfully. Let us know if you
> have an issues, and a contribution about this process to our documentation
> would be very welcome!
> 
> HTH,
> Jörn

Reply via email to