Thanks, Sanjeev. I was actually asking about the data used to train the
tokenizers provided by OpenNLP. I'll start a new thread to prevent
confusion. Sorry about that.


On Sat, Mar 29, 2014 at 7:23 PM, Sanjeev Sharma <
[email protected]> wrote:

> Sorry, can't share the data due to privacy concerns.  The way I got this
> data was to extract text from word doc resumes, cat them into a single text
> file, and tagged only the names using <START:person> and <END> tags.  I'm
> using 20 or so resumes for initial experimentation, but the actual training
> data will have several hundred resumes.
>
> -----Original Message-----
> From: Stuart Robinson [mailto:[email protected]]
> Sent: Saturday, March 29, 2014 8:01 PM
> To: [email protected]
> Subject: Re: Training new models
>
> Is the training data used to train the tokenizer models available?
> Specifically, I'm interested in the data used to train the English
> tokenizer:
>
> http://opennlp.sourceforge.net/models-1.5/en-token.bin
>
> Thanks,
> Stuart Robinson
>
> > On Mar 29, 2014, at 10:12 AM, Sanjeev Sharma
> > <[email protected]> wrote:
> >
> > Jorn,
> >
> > Thanks you for your reply.  Here is what I tried as a simple test:
> >
> > - tagged the names on about 20 resumes using "<START:person><END>"
> > notation
> > - concatenated them into a single text file.
> > - created a new .bin file using the following command
> >
> >    >opennlp TokenNameFinderTrainer -model persons.bin -lang en -data
> > train.txt -encoding UTF-8
> > - using this model file and TokenNameFinderModel tried to identify a
> > name in one of the resumes I used for training.  (I can post the code
> > if you
> > need.)
> >
> > Should this work?  If not, what am I doing wrong?
> >
> > Thanks,
> > Sanjeev.
> >
> > -----Original Message-----
> > From: Jörn Kottmann [mailto:[email protected]]
> > Sent: Friday, March 28, 2014 5:04 AM
> > To: [email protected]
> > Subject: Re: Training new models
> >
> >> On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:
> >> Hi,
> >>
> >>
> >>
> >> I am new to OpenNLP.  I've been playing with chunking, tokenizing,
> >> POS tagging, and Name recognition for a few days.  I've been
> >> following the example code and using preexisting models from
> >> http://opennlp.sourceforge.net/models-1.5/.  I've been having some
> >> trouble with name recognition and organization recognition in that
> >> using the above mentioned models I can only identify common names or
> >> organizations like "Mike Smith" and "IBM".  In addition I need to be
> >> able to find date ranges and technical language like "Java", "C++",
> >> and "HTML" (I should mention that my input is going to be resumes).
> >>
> >>
> >>
> >> I figured I need to train my own models, especially since my training
> >> data should look more like my input to give a better context (i.e.
> > resumes).
> >> I've been trying to find some information on how to do this in the
> >> documentation and also doing google searches.  I found a few simple
> >> examples, but not much more.  I did see the example in the
> >> documentation with the "<START:person> <END>" tags and the command
> >> line to process the training data into a .bin file, but nothing with
> >> organization names.  I tried to look at one or two of the annotation
> >> guides and that created more questions than answers (for example, the
> >> annotation guides not consistent with each other or the example in
> >> the documentation.  Are there pros and cons between the different
> >> formats?
> >> Are the examples in the documentation in a native format?  Is there a
> >> conversion utility?  If so and I'm creating data from scratch, would
> >> it not be better to just put it in the native
> >> format?)
> >>
> >>
> >>
> >> I just lack understanding of OpenNLP and NLP in general and the
> >> OpenNLP Manual just hasn't worked for me.  Maybe I'm just
> >> misinterpreting the documentation or just not looking in the right
> >> place.  I would appreciate it greatly if someone could point me in
> >> the right direction in the way of real world examples of training a
> >> model, recommending a book I can read through, or maybe just some
> >> good examples of training data.  Beyond the specific task I'm trying
> >> to accomplish, I would like to get a deeper understanding of how
> >> OpenNLP
> > works.
> >
> > Hello,
> >
> > the OpenNLP Name Finder training format is rather simple, as you
> > already figured out, you need to use the <START:entity_name> and <END>
> > tags to mark the name in tokenized plain text documents.
> >
> > In the example above you could replace <START:person> with
> > <START:organization> to markup an organization name in your text.
> >
> > To create a model which performs on your documents you will have to
> > label quite a few of them and using a text editor to insert the tags
> > is an approach which does not scale for more than a few documents.
> >
> > I suggest to have a look at brat:
> > http://brat.nlplab.org/
> >
> > Brat has a few issues in the 1.3 release version, but they are now
> > resolved in the trunk, I recommend to use it instead of 1.3.
> >
> > The OpenNLP Name Finder in the trunk version can be directly trained
> > on the brat format.
> > If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to
> > convert the data into the above discussed OpenNLP format.
> >
> > I know a few people who have done this successfully. Let us know if
> > you have an issues, and a contribution about this process to our
> > documentation would be very welcome!
> >
> > HTH,
> > Jörn
>

Reply via email to