Here you can find raw data I used to create a German model, maybe its useful for you:

http://www.thomas-zastrow.de/nlp/

("Raw trainingdata in OpenNLP format")


Am 22.04.2016 um 10:17 schrieb Robert Logue:
Can anyone help here? I don't want to start creating a large training file and 
find out I have gone about it in the wrong way.

The resources I have been looking at are

https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html

None of which gives the answers I am looking for.

Thanks,

Robert

From: [email protected]
To: [email protected]
Subject: RE: Name finder questions
Date: Wed, 20 Apr 2016 09:51:25 +0100

I have a few questions regarding creating my own training data for the name 
finder. I would like to distinguish between people, organizations and 
locations. The example in the documentation shows the tags to use for people ie

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 
29 .So would I used <START:organization><END> and <START:location><END> for organizations 
and locations respectively? The name entity guidelines in the documentation ie

https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides

seem to show different tags getting used which has confused me slightly as to 
which tags I should actually use?

Also I see the 15,000 line recommendation is there any performance hit if you 
use many more lines?

If I create my plain text training file as I outlined above is there any other 
params that are recommended to use beyond the basic ie

opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data 
TRAINING_FILE.train -encoding UTF-8

For instance what is the -params training parameters file used for? Is this 
necessary should this list the named entities I am looking for ie person, 
organization and location if so what format should it be in?

Sorry for the basic questions here but kind find the answers in the 
documentation or from a quick google.

Thanks,

Robert


From: [email protected]
Date: Mon, 18 Apr 2016 09:36:24 +0200
Subject: Re: Name finder questions
To: [email protected]

Hello,

Yes, that is the idea.

R

On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <[email protected]> wrote:
I am slightly confused what I can use the data in those links for? So can I use 
this data with the training tool like the following

opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
-data DOWNLOADED_FILE_NAME -encoding UTF-8
And that should give me a better model file for when I use the name finder?

Thanks,

Robert

From: [email protected]
Date: Fri, 15 Apr 2016 17:12:20 +0200
Subject: Re: Name finder questions
To: [email protected]

Hi Robert,

On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <[email protected]> wrote:
Hello,

I have just started using OpenNLP in the java application. I am just getting my 
used with the software and have a couple of newbie questions.

I see for the name finder there is different model data for people and 
organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way 
to combine these into one file so I can do 1 search that will give me back 
person names and organization names. Or is this not possible and is it best to 
do two searches?
This used to be experimental. It is not anymore, namely, you can train
a name finder model for more than one entity type. The models
available were trained with rather old newswire data so I would
recommend you to obtain train new models using OpenNLP:

http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool

I suppose you do not have manually annotated training data so I could
recommend to get the Ontonotes corpus.

https://catalog.ldc.upenn.edu/LDC2013T19

https://github.com/ontonotes/conll-formatted-ontonotes-5.0

Another option is to get a silver standard corpus obtained
automatically from the Wikipedia:

http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia

For Dutch, Spanish, German and Italian (that I know of) there are free
resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.

This question isn't related to the name finder and I don't think it is possible 
but thought I would ask anyway. If I had two sentences say 'Jack climbed the 
hill. He was very tired.' Is there any way to know that the pronoun, he, at the 
start of the second sentence is actually about Jack the subject of the first 
sentence? I know in this simple case it is obvious but I am wondering if there 
is anything in the OpenNLP software that will help with this?
The example you mentioned is called "pronominal anaphora" and it
generalizes in the coreference resolution problem. There used to be a
coreference tool in OpenNLP but got moved to the Sandbox because many
things need to be updated to be able to distribute it.

See http://conll.cemantix.org/2012/introduction.html for more details.

HTH,

R
                                        
                                        

--
Dr. Thomas Zastrow
Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG)
Gießenbachstr. 2, D-85748 Garching bei München, Germany
Tel +49-89-3299-1457
http://www.rzg.mpg.de

Reply via email to