Thank you Jorn. -----Original Message----- From: Joern Kottmann [mailto:[email protected]] Sent: Sunday, March 30, 2014 12:54 PM To: [email protected] Subject: Re: Training new models
You should use a few hundred maybe up to a bit over a thousand to get good performance. The model training command looks good. To get anything detecetd you will need more data. And I would use the perceptron with a cutoff of zero instead the default maxent with cutoff of five. HTH, Jörn On Sun, Mar 30, 2014 at 7:01 AM, Stuart Robinson <[email protected] > wrote: > Thanks, Sanjeev. I was actually asking about the data used to train > the tokenizers provided by OpenNLP. I'll start a new thread to prevent > confusion. Sorry about that. > > > On Sat, Mar 29, 2014 at 7:23 PM, Sanjeev Sharma < > [email protected]> wrote: > > > Sorry, can't share the data due to privacy concerns. The way I got > > this data was to extract text from word doc resumes, cat them into a > > single > text > > file, and tagged only the names using <START:person> and <END> tags. > > I'm using 20 or so resumes for initial experimentation, but the > > actual > training > > data will have several hundred resumes. > > > > -----Original Message----- > > From: Stuart Robinson [mailto:[email protected]] > > Sent: Saturday, March 29, 2014 8:01 PM > > To: [email protected] > > Subject: Re: Training new models > > > > Is the training data used to train the tokenizer models available? > > Specifically, I'm interested in the data used to train the English > > tokenizer: > > > > http://opennlp.sourceforge.net/models-1.5/en-token.bin > > > > Thanks, > > Stuart Robinson > > > > > On Mar 29, 2014, at 10:12 AM, Sanjeev Sharma > > > <[email protected]> wrote: > > > > > > Jorn, > > > > > > Thanks you for your reply. Here is what I tried as a simple test: > > > > > > - tagged the names on about 20 resumes using "<START:person><END>" > > > notation > > > - concatenated them into a single text file. > > > - created a new .bin file using the following command > > > > > > >opennlp TokenNameFinderTrainer -model persons.bin -lang en > > > -data train.txt -encoding UTF-8 > > > - using this model file and TokenNameFinderModel tried to identify > > > a name in one of the resumes I used for training. (I can post the > > > code if you > > > need.) > > > > > > Should this work? If not, what am I doing wrong? > > > > > > Thanks, > > > Sanjeev. > > > > > > -----Original Message----- > > > From: Jörn Kottmann [mailto:[email protected]] > > > Sent: Friday, March 28, 2014 5:04 AM > > > To: [email protected] > > > Subject: Re: Training new models > > > > > >> On 03/27/2014 11:35 PM, Sanjeev Sharma wrote: > > >> Hi, > > >> > > >> > > >> > > >> I am new to OpenNLP. I've been playing with chunking, > > >> tokenizing, POS tagging, and Name recognition for a few days. > > >> I've been following the example code and using preexisting models > > >> from http://opennlp.sourceforge.net/models-1.5/. I've been > > >> having some trouble with name recognition and organization > > >> recognition in that using the above mentioned models I can only > > >> identify common names or organizations like "Mike Smith" and > > >> "IBM". In addition I need to be able to find date ranges and > > >> technical language like "Java", "C++", and "HTML" (I should mention > > >> that my input is going to be resumes). > > >> > > >> > > >> > > >> I figured I need to train my own models, especially since my > > >> training data should look more like my input to give a better context > > >> (i.e. > > > resumes). > > >> I've been trying to find some information on how to do this in > > >> the documentation and also doing google searches. I found a few > > >> simple examples, but not much more. I did see the example in the > > >> documentation with the "<START:person> <END>" tags and the > > >> command line to process the training data into a .bin file, but > > >> nothing with organization names. I tried to look at one or two > > >> of the annotation guides and that created more questions than > > >> answers (for example, the annotation guides not consistent with > > >> each other or the example in the documentation. Are there pros > > >> and cons between the different formats? > > >> Are the examples in the documentation in a native format? Is > > >> there a conversion utility? If so and I'm creating data from > > >> scratch, would it not be better to just put it in the native > > >> format?) > > >> > > >> > > >> > > >> I just lack understanding of OpenNLP and NLP in general and the > > >> OpenNLP Manual just hasn't worked for me. Maybe I'm just > > >> misinterpreting the documentation or just not looking in the > > >> right place. I would appreciate it greatly if someone could > > >> point me in the right direction in the way of real world examples > > >> of training a model, recommending a book I can read through, or > > >> maybe just some good examples of training data. Beyond the > > >> specific task I'm trying to accomplish, I would like to get a > > >> deeper understanding of how OpenNLP > > > works. > > > > > > Hello, > > > > > > the OpenNLP Name Finder training format is rather simple, as you > > > already figured out, you need to use the <START:entity_name> and > > > <END> tags to mark the name in tokenized plain text documents. > > > > > > In the example above you could replace <START:person> with > > > <START:organization> to markup an organization name in your text. > > > > > > To create a model which performs on your documents you will have > > > to label quite a few of them and using a text editor to insert the > > > tags is an approach which does not scale for more than a few > > > documents. > > > > > > I suggest to have a look at brat: > > > http://brat.nlplab.org/ > > > > > > Brat has a few issues in the 1.3 release version, but they are now > > > resolved in the trunk, I recommend to use it instead of 1.3. > > > > > > The OpenNLP Name Finder in the trunk version can be directly > > > trained on the brat format. > > > If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 > > > to convert the data into the above discussed OpenNLP format. > > > > > > I know a few people who have done this successfully. Let us know > > > if you have an issues, and a contribution about this process to > > > our documentation would be very welcome! > > > > > > HTH, > > > Jörn > > >
