RE: Training new models

Sanjeev Sharma Mon, 31 Mar 2014 13:36:32 -0700

Thank you Jorn.

-----Original Message-----
From: Joern Kottmann [mailto:[email protected]]
Sent: Sunday, March 30, 2014 12:54 PM
To: [email protected]
Subject: Re: Training new models


You should use a few hundred maybe up to a bit over a thousand to get good
performance.

The model training command looks good. To get anything detecetd you will
need more data. And I would use the perceptron with a cutoff of zero instead
the default maxent with cutoff of five.

HTH,
Jörn


On Sun, Mar 30, 2014 at 7:01 AM, Stuart Robinson <[email protected]
> wrote:

> Thanks, Sanjeev. I was actually asking about the data used to train
> the tokenizers provided by OpenNLP. I'll start a new thread to prevent
> confusion. Sorry about that.
>
>
> On Sat, Mar 29, 2014 at 7:23 PM, Sanjeev Sharma <
> [email protected]> wrote:
>
> > Sorry, can't share the data due to privacy concerns.  The way I got
> > this data was to extract text from word doc resumes, cat them into a
> > single
> text
> > file, and tagged only the names using <START:person> and <END> tags.
> > I'm using 20 or so resumes for initial experimentation, but the
> > actual
> training
> > data will have several hundred resumes.
> >
> > -----Original Message-----
> > From: Stuart Robinson [mailto:[email protected]]
> > Sent: Saturday, March 29, 2014 8:01 PM
> > To: [email protected]
> > Subject: Re: Training new models
> >
> > Is the training data used to train the tokenizer models available?
> > Specifically, I'm interested in the data used to train the English
> > tokenizer:
> >
> > http://opennlp.sourceforge.net/models-1.5/en-token.bin
> >
> > Thanks,
> > Stuart Robinson
> >
> > > On Mar 29, 2014, at 10:12 AM, Sanjeev Sharma
> > > <[email protected]> wrote:
> > >
> > > Jorn,
> > >
> > > Thanks you for your reply.  Here is what I tried as a simple test:
> > >
> > > - tagged the names on about 20 resumes using "<START:person><END>"
> > > notation
> > > - concatenated them into a single text file.
> > > - created a new .bin file using the following command
> > >
> > >    >opennlp TokenNameFinderTrainer -model persons.bin -lang en
> > > -data train.txt -encoding UTF-8
> > > - using this model file and TokenNameFinderModel tried to identify
> > > a name in one of the resumes I used for training.  (I can post the
> > > code if you
> > > need.)
> > >
> > > Should this work?  If not, what am I doing wrong?
> > >
> > > Thanks,
> > > Sanjeev.
> > >
> > > -----Original Message-----
> > > From: Jörn Kottmann [mailto:[email protected]]
> > > Sent: Friday, March 28, 2014 5:04 AM
> > > To: [email protected]
> > > Subject: Re: Training new models
> > >
> > >> On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:
> > >> Hi,
> > >>
> > >>
> > >>
> > >> I am new to OpenNLP.  I've been playing with chunking,
> > >> tokenizing, POS tagging, and Name recognition for a few days.
> > >> I've been following the example code and using preexisting models
> > >> from http://opennlp.sourceforge.net/models-1.5/.  I've been
> > >> having some trouble with name recognition and organization
> > >> recognition in that using the above mentioned models I can only
> > >> identify common names or organizations like "Mike Smith" and
> > >> "IBM".  In addition I need to be able to find date ranges and
> > >> technical language like "Java", "C++", and "HTML" (I should mention
> > >> that my input is going to be resumes).
> > >>
> > >>
> > >>
> > >> I figured I need to train my own models, especially since my
> > >> training data should look more like my input to give a better context
> > >> (i.e.
> > > resumes).
> > >> I've been trying to find some information on how to do this in
> > >> the documentation and also doing google searches.  I found a few
> > >> simple examples, but not much more.  I did see the example in the
> > >> documentation with the "<START:person> <END>" tags and the
> > >> command line to process the training data into a .bin file, but
> > >> nothing with organization names.  I tried to look at one or two
> > >> of the annotation guides and that created more questions than
> > >> answers (for example, the annotation guides not consistent with
> > >> each other or the example in the documentation.  Are there pros
> > >> and cons between the different formats?
> > >> Are the examples in the documentation in a native format?  Is
> > >> there a conversion utility?  If so and I'm creating data from
> > >> scratch, would it not be better to just put it in the native
> > >> format?)
> > >>
> > >>
> > >>
> > >> I just lack understanding of OpenNLP and NLP in general and the
> > >> OpenNLP Manual just hasn't worked for me.  Maybe I'm just
> > >> misinterpreting the documentation or just not looking in the
> > >> right place.  I would appreciate it greatly if someone could
> > >> point me in the right direction in the way of real world examples
> > >> of training a model, recommending a book I can read through, or
> > >> maybe just some good examples of training data.  Beyond the
> > >> specific task I'm trying to accomplish, I would like to get a
> > >> deeper understanding of how OpenNLP
> > > works.
> > >
> > > Hello,
> > >
> > > the OpenNLP Name Finder training format is rather simple, as you
> > > already figured out, you need to use the <START:entity_name> and
> > > <END> tags to mark the name in tokenized plain text documents.
> > >
> > > In the example above you could replace <START:person> with
> > > <START:organization> to markup an organization name in your text.
> > >
> > > To create a model which performs on your documents you will have
> > > to label quite a few of them and using a text editor to insert the
> > > tags is an approach which does not scale for more than a few
> > > documents.
> > >
> > > I suggest to have a look at brat:
> > > http://brat.nlplab.org/
> > >
> > > Brat has a few issues in the 1.3 release version, but they are now
> > > resolved in the trunk, I recommend to use it instead of 1.3.
> > >
> > > The OpenNLP Name Finder in the trunk version can be directly
> > > trained on the brat format.
> > > If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0
> > > to convert the data into the above discussed OpenNLP format.
> > >
> > > I know a few people who have done this successfully. Let us know
> > > if you have an issues, and a contribution about this process to
> > > our documentation would be very welcome!
> > >
> > > HTH,
> > > Jörn
> >
>

RE: Training new models

Reply via email to