No, you dont have to generate an example for every entity in the lists you
have.
Sample 15,000 sentences from the text related to the domain you are going
to use NER on.
Just annotate all the names in these sentences if there are any.
If you want to make use of the list, you can supply the list to OpenNLP to
do a dictionary lookup.

You can also use the list to bootstrap the training data. [This is an
advanced way, just ignore if you dont understand]

On Mon, Aug 17, 2015 at 5:22 PM, Damiano Porta <[email protected]>
wrote:

> Hello Vihari, thank you for your reply!
>
> Are you sure i should write all the names/companies i have? I have trained
> this simple model:
>
> (en-ner-person.train)
> The name is <START:person> Barack Obama <END> .
> My name is <START:person> Barack Bla<END> .
> Her name is <START:person> Maria <END> .
> His name is <START:person> Barack <END> .
>
> then i build it with:
>
> *bash /home/damiano/lavoro/bin/opennlp-1.6.0/bin/opennlp
> TokenNameFinderTrainer -encoding UTF-8 -lang en -data
> /home/damiano/en-ner-person.train -model /home/damiano/en-ner-person.bin*
>
> then i launched the console with this model doing:
>
> *bash /home/damiano/lavoro/bin/opennlp-1.6.0/bin/opennlp TokenNameFinder
> /home/damiano/en-ner-person.bin*
>
> this is the output:
>
> name is Barack .
> name is <START:person> Barack <END> .
> name is Damiano
> name is <START:person> Damiano <END>
>
> as you can see it detects "Damiano" as Person too. But I *never* used it in
> the trained model. So, then question is, do I really need to use all the
> name/surname combinations to create a model for persons?
>
> Same question for companies, as i Wrote i have 1M companies, do i really
> need to write all of those on the trained model ?
>
> Thank you for the clarification!
>
> 2015-08-17 11:55 GMT+02:00 Vihari Piratla <[email protected]>:
>
> > Hi
> > I suggest using the OpenNLP with the default models available here:
> > http://opennlp.sourceforge.net/models-1.5/
> > These models can recognize people, location (not addresses) and
> > organization names.
> > If this does not perform satisfactorily (which is most often the case),
> > train the model as you have described in point 2 of your mail.
> > Yes, the training data creation is very time consuming. OpenNLP suggests
> > the training data is at least 15,000 sentences big for reasonable model
> > performance.
> >
> > If you want to recognize only addresses and not interested in locations
> in
> > general, I suggest
> > you recognize entities of the three types and then do a regular
> expression
> > like pattern matching. For example <Person
> > Name>(\\W+)<Location>(\\W+)<NUMBER>(\\W+)<ZIPCODE> e.t.c.
> >
> >
> > On Mon, Aug 17, 2015 at 2:55 AM, Damiano Porta <[email protected]>
> > wrote:
> >
> > > Hello everybody,
> > > I have just joined this mailing list! Thank you in advance for your
> help.
> > >
> > > I am studying a simple analizer that extracts specific information
> from a
> > > text. The information i would like to extract are:
> > >
> > > 1. Person
> > > 2. Company
> > > 3. Email address
> > > 4. Zipcode
> > > 5. Home address
> > >
> > > for email address and zipcode i directly use *RegexNameFinder,* emails
> > have
> > > specific format so a regex should work without problems, zipcodes too
> (5
> > > digits long, only numbers). In this case RegexNameFinder works
> perfectly.
> > >
> > > The problems are for Person, Company and home addresses. I read the
> > > documentation for Named Entity Recognition but i have the following
> > doubts:
> > >
> > > 1. I have a complete italian name/surname database (csv) i would like
> to
> > > understand how to create the train model correctly. I see that i have
> to
> > > use a specific tag like <START:person> Person name here </END> in a
> > > context! As I wrote I only have name and surname (one per line) so in
> > this
> > > case how can i create the model? Do i have to create fake sentences and
> > put
> > > the names there?
> > >
> > > 2. Let suppose we have those sentences do i have to write all the
> > > name/surname combinations to let opennlp understand when a token (or
> more
> > > tokens) is a Person ? Example:
> > >
> > >
> > > *<START:person> Barack <END> , 61 years old , will join the board as a
> > > nonexecutive director Nov. 29 .<START:person> Barack Obama <END> , 61
> > years
> > > old , will join the board as a nonexecutive director Nov. 29 .*
> > >
> > > *<START:person> Bill <END> , 61 years old , will join the board as a
> > > nonexecutive director Nov. 29 .*
> > >
> > > *<START:person> **Bill Clinton <END> , 61 years old , will join the
> board
> > > as a nonexecutive director Nov. 29 .*
> > >
> > > ...and so on.. ?
> > >
> > > 3. Same doubt for companies, I have a very big database with around 1M
> > > companies names, what is the best solution to train open nlp for those
> > > names?
> > >
> > > Last but not the least...
> > >
> > > 4. What is the best way to train opennlp for home addresses? In italy
> for
> > > example the "format" is:
> > >
> > > Name Surname
> > > address, number, zip-code
> > > City
> > > Country
> > >
> > > Thank you so much!
> > >
> >
> >
> >
> > --
> > V
> >
>



-- 
V

Reply via email to