Thanks, Sanjeev. I was actually asking about the data used to train the tokenizers provided by OpenNLP. I'll start a new thread to prevent confusion. Sorry about that.
On Sat, Mar 29, 2014 at 7:23 PM, Sanjeev Sharma < [email protected]> wrote: > Sorry, can't share the data due to privacy concerns. The way I got this > data was to extract text from word doc resumes, cat them into a single text > file, and tagged only the names using <START:person> and <END> tags. I'm > using 20 or so resumes for initial experimentation, but the actual training > data will have several hundred resumes. > > -----Original Message----- > From: Stuart Robinson [mailto:[email protected]] > Sent: Saturday, March 29, 2014 8:01 PM > To: [email protected] > Subject: Re: Training new models > > Is the training data used to train the tokenizer models available? > Specifically, I'm interested in the data used to train the English > tokenizer: > > http://opennlp.sourceforge.net/models-1.5/en-token.bin > > Thanks, > Stuart Robinson > > > On Mar 29, 2014, at 10:12 AM, Sanjeev Sharma > > <[email protected]> wrote: > > > > Jorn, > > > > Thanks you for your reply. Here is what I tried as a simple test: > > > > - tagged the names on about 20 resumes using "<START:person><END>" > > notation > > - concatenated them into a single text file. > > - created a new .bin file using the following command > > > > >opennlp TokenNameFinderTrainer -model persons.bin -lang en -data > > train.txt -encoding UTF-8 > > - using this model file and TokenNameFinderModel tried to identify a > > name in one of the resumes I used for training. (I can post the code > > if you > > need.) > > > > Should this work? If not, what am I doing wrong? > > > > Thanks, > > Sanjeev. > > > > -----Original Message----- > > From: Jörn Kottmann [mailto:[email protected]] > > Sent: Friday, March 28, 2014 5:04 AM > > To: [email protected] > > Subject: Re: Training new models > > > >> On 03/27/2014 11:35 PM, Sanjeev Sharma wrote: > >> Hi, > >> > >> > >> > >> I am new to OpenNLP. I've been playing with chunking, tokenizing, > >> POS tagging, and Name recognition for a few days. I've been > >> following the example code and using preexisting models from > >> http://opennlp.sourceforge.net/models-1.5/. I've been having some > >> trouble with name recognition and organization recognition in that > >> using the above mentioned models I can only identify common names or > >> organizations like "Mike Smith" and "IBM". In addition I need to be > >> able to find date ranges and technical language like "Java", "C++", > >> and "HTML" (I should mention that my input is going to be resumes). > >> > >> > >> > >> I figured I need to train my own models, especially since my training > >> data should look more like my input to give a better context (i.e. > > resumes). > >> I've been trying to find some information on how to do this in the > >> documentation and also doing google searches. I found a few simple > >> examples, but not much more. I did see the example in the > >> documentation with the "<START:person> <END>" tags and the command > >> line to process the training data into a .bin file, but nothing with > >> organization names. I tried to look at one or two of the annotation > >> guides and that created more questions than answers (for example, the > >> annotation guides not consistent with each other or the example in > >> the documentation. Are there pros and cons between the different > >> formats? > >> Are the examples in the documentation in a native format? Is there a > >> conversion utility? If so and I'm creating data from scratch, would > >> it not be better to just put it in the native > >> format?) > >> > >> > >> > >> I just lack understanding of OpenNLP and NLP in general and the > >> OpenNLP Manual just hasn't worked for me. Maybe I'm just > >> misinterpreting the documentation or just not looking in the right > >> place. I would appreciate it greatly if someone could point me in > >> the right direction in the way of real world examples of training a > >> model, recommending a book I can read through, or maybe just some > >> good examples of training data. Beyond the specific task I'm trying > >> to accomplish, I would like to get a deeper understanding of how > >> OpenNLP > > works. > > > > Hello, > > > > the OpenNLP Name Finder training format is rather simple, as you > > already figured out, you need to use the <START:entity_name> and <END> > > tags to mark the name in tokenized plain text documents. > > > > In the example above you could replace <START:person> with > > <START:organization> to markup an organization name in your text. > > > > To create a model which performs on your documents you will have to > > label quite a few of them and using a text editor to insert the tags > > is an approach which does not scale for more than a few documents. > > > > I suggest to have a look at brat: > > http://brat.nlplab.org/ > > > > Brat has a few issues in the 1.3 release version, but they are now > > resolved in the trunk, I recommend to use it instead of 1.3. > > > > The OpenNLP Name Finder in the trunk version can be directly trained > > on the brat format. > > If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to > > convert the data into the above discussed OpenNLP format. > > > > I know a few people who have done this successfully. Let us know if > > you have an issues, and a contribution about this process to our > > documentation would be very welcome! > > > > HTH, > > Jörn >
