RE: Name finder questions

Robert Logue Mon, 25 Apr 2016 06:11:35 -0700

The area I would be looking in would be sports and the only things I would be 
interested in would be the 3 things I mentioned ie


People, Organizations and Location

Do you think there is existing corpora that would cover this? Or would there be 
benefit in creating my own?

Thanks,
Robert

> From: [email protected]
> Date: Mon, 25 Apr 2016 09:39:48 +0200
> Subject: Re: Name finder questions
> To: [email protected]
> 
> Hi Robert,
> 
> Performance varies a lot, and that is still the subject of research.
> Basically, more data always helps, but depending on the type of data,
> number of entity types, etc., the quantity required differs. If you
> need to tag persons, locations and organizations on news or similar
> text genre I recommend you to use one of the already existing corpora
> and avoid tagging your own data.
> 
> Which genre are you interested in?
> 
> R
> 
> On Fri, Apr 22, 2016 at 10:31 AM, Robert Logue <[email protected]> wrote:
> > Very useful, thank you.
> >
> > Only question I have left now, for the moment, is on performance. The 
> > minimum recommend number of sentences is 15,000 does anyone know how much 
> > this would need to be increased to before it would, maybe it never would, 
> > become a performance issue? So if I created training data with 100,000 
> > sentences would this be an issue? Is there any number I could go to where 
> > it would be an issue?
> >
> > Thanks,
> >
> > Robert
> >
> >> Subject: Re: Name finder questions
> >> To: [email protected]
> >> From: [email protected]
> >> Date: Fri, 22 Apr 2016 10:22:50 +0200
> >>
> >> Here you can find raw data I used to create a German model, maybe its
> >> useful for you:
> >>
> >> http://www.thomas-zastrow.de/nlp/
> >>
> >> ("Raw trainingdata in OpenNLP format")
> >>
> >>
> >> Am 22.04.2016 um 10:17 schrieb Robert Logue:
> >> > Can anyone help here? I don't want to start creating a large training 
> >> > file and find out I have gone about it in the wrong way.
> >> >
> >> > The resources I have been looking at are
> >> >
> >> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
> >> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
> >> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html
> >> >
> >> > None of which gives the answers I am looking for.
> >> >
> >> > Thanks,
> >> >
> >> > Robert
> >> >
> >> >> From: [email protected]
> >> >> To: [email protected]
> >> >> Subject: RE: Name finder questions
> >> >> Date: Wed, 20 Apr 2016 09:51:25 +0100
> >> >>
> >> >> I have a few questions regarding creating my own training data for the 
> >> >> name finder. I would like to distinguish between people, organizations 
> >> >> and locations. The example in the documentation shows the tags to use 
> >> >> for people ie
> >> >>
> >> >> <START:person> Pierre Vinken <END> , 61 years old , will join the board 
> >> >> as a nonexecutive director Nov. 29 .So would I used 
> >> >> <START:organization><END> and <START:location><END> for organizations 
> >> >> and locations respectively? The name entity guidelines in the 
> >> >> documentation ie
> >> >>
> >> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides
> >> >>
> >> >> seem to show different tags getting used which has confused me slightly 
> >> >> as to which tags I should actually use?
> >> >>
> >> >> Also I see the 15,000 line recommendation is there any performance hit 
> >> >> if you use many more lines?
> >> >>
> >> >> If I create my plain text training file as I outlined above is there 
> >> >> any other params that are recommended to use beyond the basic ie
> >> >>
> >> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data 
> >> >> TRAINING_FILE.train -encoding UTF-8
> >> >>
> >> >> For instance what is the -params training parameters file used for? Is 
> >> >> this necessary should this list the named entities I am looking for ie 
> >> >> person, organization and location if so what format should it be in?
> >> >>
> >> >> Sorry for the basic questions here but kind find the answers in the 
> >> >> documentation or from a quick google.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Robert
> >> >>
> >> >>
> >> >>> From: [email protected]
> >> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200
> >> >>> Subject: Re: Name finder questions
> >> >>> To: [email protected]
> >> >>>
> >> >>> Hello,
> >> >>>
> >> >>> Yes, that is the idea.
> >> >>>
> >> >>> R
> >> >>>
> >> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <[email protected]> 
> >> >>> wrote:
> >> >>>> I am slightly confused what I can use the data in those links for? So 
> >> >>>> can I use this data with the training tool like the following
> >> >>>>
> >> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
> >> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8
> >> >>>> And that should give me a better model file for when I use the name 
> >> >>>> finder?
> >> >>>>
> >> >>>> Thanks,
> >> >>>>
> >> >>>> Robert
> >> >>>>
> >> >>>>> From: [email protected]
> >> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200
> >> >>>>> Subject: Re: Name finder questions
> >> >>>>> To: [email protected]
> >> >>>>>
> >> >>>>> Hi Robert,
> >> >>>>>
> >> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue 
> >> >>>>> <[email protected]> wrote:
> >> >>>>>> Hello,
> >> >>>>>>
> >> >>>>>> I have just started using OpenNLP in the java application. I am 
> >> >>>>>> just getting my used with the software and have a couple of newbie 
> >> >>>>>> questions.
> >> >>>>>>
> >> >>>>>> I see for the name finder there is different model data for people 
> >> >>>>>> and organizations (en-ner-organization.bin and en-ner-person.bin). 
> >> >>>>>> Is there any way to combine these into one file so I can do 1 
> >> >>>>>> search that will give me back person names and organization names. 
> >> >>>>>> Or is this not possible and is it best to do two searches?
> >> >>>>> This used to be experimental. It is not anymore, namely, you can 
> >> >>>>> train
> >> >>>>> a name finder model for more than one entity type. The models
> >> >>>>> available were trained with rather old newswire data so I would
> >> >>>>> recommend you to obtain train new models using OpenNLP:
> >> >>>>>
> >> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
> >> >>>>>
> >> >>>>> I suppose you do not have manually annotated training data so I could
> >> >>>>> recommend to get the Ontonotes corpus.
> >> >>>>>
> >> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19
> >> >>>>>
> >> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
> >> >>>>>
> >> >>>>> Another option is to get a silver standard corpus obtained
> >> >>>>> automatically from the Wikipedia:
> >> >>>>>
> >> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
> >> >>>>>
> >> >>>>> For Dutch, Spanish, German and Italian (that I know of) there are 
> >> >>>>> free
> >> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 
> >> >>>>> 2009.
> >> >>>>>
> >> >>>>>> This question isn't related to the name finder and I don't think it 
> >> >>>>>> is possible but thought I would ask anyway. If I had two sentences 
> >> >>>>>> say 'Jack climbed the hill. He was very tired.' Is there any way to 
> >> >>>>>> know that the pronoun, he, at the start of the second sentence is 
> >> >>>>>> actually about Jack the subject of the first sentence? I know in 
> >> >>>>>> this simple case it is obvious but I am wondering if there is 
> >> >>>>>> anything in the OpenNLP software that will help with this?
> >> >>>>> The example you mentioned is called "pronominal anaphora" and it
> >> >>>>> generalizes in the coreference resolution problem. There used to be a
> >> >>>>> coreference tool in OpenNLP but got moved to the Sandbox because many
> >> >>>>> things need to be updated to be able to distribute it.
> >> >>>>>
> >> >>>>> See http://conll.cemantix.org/2012/introduction.html for more 
> >> >>>>> details.
> >> >>>>>
> >> >>>>> HTH,
> >> >>>>>
> >> >>>>> R
> >> >>
> >> >
> >>
> >> --
> >> Dr. Thomas Zastrow
> >> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG)
> >> Gießenbachstr. 2, D-85748 Garching bei München, Germany
> >> Tel +49-89-3299-1457
> >> http://www.rzg.mpg.de
> >>
> >

RE: Name finder questions

Reply via email to