Is the training data used to train the tokenizer models available? Specifically, I'm interested in the data used to train the English tokenizer:
http://opennlp.sourceforge.net/models-1.5/en-token.bin Thanks, Stuart Robinson > On Mar 29, 2014, at 10:12 AM, Sanjeev Sharma > <[email protected]> wrote: > > Jorn, > > Thanks you for your reply. Here is what I tried as a simple test: > > - tagged the names on about 20 resumes using "<START:person><END>" > notation > - concatenated them into a single text file. > - created a new .bin file using the following command > > >opennlp TokenNameFinderTrainer -model persons.bin -lang en -data > train.txt -encoding UTF-8 > - using this model file and TokenNameFinderModel tried to identify a name > in one of the resumes I used for training. (I can post the code if you > need.) > > Should this work? If not, what am I doing wrong? > > Thanks, > Sanjeev. > > -----Original Message----- > From: Jörn Kottmann [mailto:[email protected]] > Sent: Friday, March 28, 2014 5:04 AM > To: [email protected] > Subject: Re: Training new models > >> On 03/27/2014 11:35 PM, Sanjeev Sharma wrote: >> Hi, >> >> >> >> I am new to OpenNLP. I've been playing with chunking, tokenizing, POS >> tagging, and Name recognition for a few days. I've been following the >> example code and using preexisting models from >> http://opennlp.sourceforge.net/models-1.5/. I've been having some >> trouble with name recognition and organization recognition in that >> using the above mentioned models I can only identify common names or >> organizations like "Mike Smith" and "IBM". In addition I need to be >> able to find date ranges and technical language like "Java", "C++", >> and "HTML" (I should mention that my input is going to be resumes). >> >> >> >> I figured I need to train my own models, especially since my training >> data should look more like my input to give a better context (i.e. > resumes). >> I've been trying to find some information on how to do this in the >> documentation and also doing google searches. I found a few simple >> examples, but not much more. I did see the example in the >> documentation with the "<START:person> <END>" tags and the command >> line to process the training data into a .bin file, but nothing with >> organization names. I tried to look at one or two of the annotation >> guides and that created more questions than answers (for example, the >> annotation guides not consistent with each other or the example in the >> documentation. Are there pros and cons between the different formats? >> Are the examples in the documentation in a native format? Is there a >> conversion utility? If so and I'm creating data from scratch, would >> it not be better to just put it in the native >> format?) >> >> >> >> I just lack understanding of OpenNLP and NLP in general and the >> OpenNLP Manual just hasn't worked for me. Maybe I'm just >> misinterpreting the documentation or just not looking in the right >> place. I would appreciate it greatly if someone could point me in the >> right direction in the way of real world examples of training a model, >> recommending a book I can read through, or maybe just some good >> examples of training data. Beyond the specific task I'm trying to >> accomplish, I would like to get a deeper understanding of how OpenNLP > works. > > Hello, > > the OpenNLP Name Finder training format is rather simple, as you already > figured out, you need to use the <START:entity_name> and <END> tags to > mark the name in tokenized plain text documents. > > In the example above you could replace <START:person> with > <START:organization> to markup an organization name in your text. > > To create a model which performs on your documents you will have to label > quite a few of them and using a text editor to insert the tags is an > approach which does not scale for more than a few documents. > > I suggest to have a look at brat: > http://brat.nlplab.org/ > > Brat has a few issues in the 1.3 release version, but they are now > resolved in the trunk, I recommend to use it instead of 1.3. > > The OpenNLP Name Finder in the trunk version can be directly trained on > the brat format. > If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to > convert the data into the above discussed OpenNLP format. > > I know a few people who have done this successfully. Let us know if you > have an issues, and a contribution about this process to our documentation > would be very welcome! > > HTH, > Jörn
