RE: Name finder questions

Robert Logue Mon, 25 Apr 2016 07:33:12 -0700

I sure did, thanks. I was more unsure if these would work as well for sports 
specifically or would it be best to make my own?


I may have missed something but they are also unclear what the files are for ie 
is it a model file for. The ones I downloaded and looked at seemed to be POS 
tagging rather than named entity tagging. May my inexperience is making me miss 
something?

Thanks,
Robert



> From: [email protected]
> Date: Mon, 25 Apr 2016 15:43:23 +0200
> Subject: Re: Name finder questions
> To: [email protected]
> 
> Did you look at the links I sent in a previous email?
> 
> R
> 
> On Mon, Apr 25, 2016 at 3:10 PM, Robert Logue <[email protected]> wrote:
> > The area I would be looking in would be sports and the only things I would 
> > be interested in would be the 3 things I mentioned ie
> >
> > People, Organizations and Location
> >
> > Do you think there is existing corpora that would cover this? Or would 
> > there be benefit in creating my own?
> >
> > Thanks,
> > Robert
> >
> >> From: [email protected]
> >> Date: Mon, 25 Apr 2016 09:39:48 +0200
> >> Subject: Re: Name finder questions
> >> To: [email protected]
> >>
> >> Hi Robert,
> >>
> >> Performance varies a lot, and that is still the subject of research.
> >> Basically, more data always helps, but depending on the type of data,
> >> number of entity types, etc., the quantity required differs. If you
> >> need to tag persons, locations and organizations on news or similar
> >> text genre I recommend you to use one of the already existing corpora
> >> and avoid tagging your own data.
> >>
> >> Which genre are you interested in?
> >>
> >> R
> >>
> >> On Fri, Apr 22, 2016 at 10:31 AM, Robert Logue <[email protected]> 
> >> wrote:
> >> > Very useful, thank you.
> >> >
> >> > Only question I have left now, for the moment, is on performance. The 
> >> > minimum recommend number of sentences is 15,000 does anyone know how 
> >> > much this would need to be increased to before it would, maybe it never 
> >> > would, become a performance issue? So if I created training data with 
> >> > 100,000 sentences would this be an issue? Is there any number I could go 
> >> > to where it would be an issue?
> >> >
> >> > Thanks,
> >> >
> >> > Robert
> >> >
> >> >> Subject: Re: Name finder questions
> >> >> To: [email protected]
> >> >> From: [email protected]
> >> >> Date: Fri, 22 Apr 2016 10:22:50 +0200
> >> >>
> >> >> Here you can find raw data I used to create a German model, maybe its
> >> >> useful for you:
> >> >>
> >> >> http://www.thomas-zastrow.de/nlp/
> >> >>
> >> >> ("Raw trainingdata in OpenNLP format")
> >> >>
> >> >>
> >> >> Am 22.04.2016 um 10:17 schrieb Robert Logue:
> >> >> > Can anyone help here? I don't want to start creating a large training 
> >> >> > file and find out I have gone about it in the wrong way.
> >> >> >
> >> >> > The resources I have been looking at are
> >> >> >
> >> >> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
> >> >> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
> >> >> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html
> >> >> >
> >> >> > None of which gives the answers I am looking for.
> >> >> >
> >> >> > Thanks,
> >> >> >
> >> >> > Robert
> >> >> >
> >> >> >> From: [email protected]
> >> >> >> To: [email protected]
> >> >> >> Subject: RE: Name finder questions
> >> >> >> Date: Wed, 20 Apr 2016 09:51:25 +0100
> >> >> >>
> >> >> >> I have a few questions regarding creating my own training data for 
> >> >> >> the name finder. I would like to distinguish between people, 
> >> >> >> organizations and locations. The example in the documentation shows 
> >> >> >> the tags to use for people ie
> >> >> >>
> >> >> >> <START:person> Pierre Vinken <END> , 61 years old , will join the 
> >> >> >> board as a nonexecutive director Nov. 29 .So would I used 
> >> >> >> <START:organization><END> and <START:location><END> for 
> >> >> >> organizations and locations respectively? The name entity guidelines 
> >> >> >> in the documentation ie
> >> >> >>
> >> >> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides
> >> >> >>
> >> >> >> seem to show different tags getting used which has confused me 
> >> >> >> slightly as to which tags I should actually use?
> >> >> >>
> >> >> >> Also I see the 15,000 line recommendation is there any performance 
> >> >> >> hit if you use many more lines?
> >> >> >>
> >> >> >> If I create my plain text training file as I outlined above is there 
> >> >> >> any other params that are recommended to use beyond the basic ie
> >> >> >>
> >> >> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data 
> >> >> >> TRAINING_FILE.train -encoding UTF-8
> >> >> >>
> >> >> >> For instance what is the -params training parameters file used for? 
> >> >> >> Is this necessary should this list the named entities I am looking 
> >> >> >> for ie person, organization and location if so what format should it 
> >> >> >> be in?
> >> >> >>
> >> >> >> Sorry for the basic questions here but kind find the answers in the 
> >> >> >> documentation or from a quick google.
> >> >> >>
> >> >> >> Thanks,
> >> >> >>
> >> >> >> Robert
> >> >> >>
> >> >> >>
> >> >> >>> From: [email protected]
> >> >> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200
> >> >> >>> Subject: Re: Name finder questions
> >> >> >>> To: [email protected]
> >> >> >>>
> >> >> >>> Hello,
> >> >> >>>
> >> >> >>> Yes, that is the idea.
> >> >> >>>
> >> >> >>> R
> >> >> >>>
> >> >> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue 
> >> >> >>> <[email protected]> wrote:
> >> >> >>>> I am slightly confused what I can use the data in those links for? 
> >> >> >>>> So can I use this data with the training tool like the following
> >> >> >>>>
> >> >> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
> >> >> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8
> >> >> >>>> And that should give me a better model file for when I use the 
> >> >> >>>> name finder?
> >> >> >>>>
> >> >> >>>> Thanks,
> >> >> >>>>
> >> >> >>>> Robert
> >> >> >>>>
> >> >> >>>>> From: [email protected]
> >> >> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200
> >> >> >>>>> Subject: Re: Name finder questions
> >> >> >>>>> To: [email protected]
> >> >> >>>>>
> >> >> >>>>> Hi Robert,
> >> >> >>>>>
> >> >> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue 
> >> >> >>>>> <[email protected]> wrote:
> >> >> >>>>>> Hello,
> >> >> >>>>>>
> >> >> >>>>>> I have just started using OpenNLP in the java application. I am 
> >> >> >>>>>> just getting my used with the software and have a couple of 
> >> >> >>>>>> newbie questions.
> >> >> >>>>>>
> >> >> >>>>>> I see for the name finder there is different model data for 
> >> >> >>>>>> people and organizations (en-ner-organization.bin and 
> >> >> >>>>>> en-ner-person.bin). Is there any way to combine these into one 
> >> >> >>>>>> file so I can do 1 search that will give me back person names 
> >> >> >>>>>> and organization names. Or is this not possible and is it best 
> >> >> >>>>>> to do two searches?
> >> >> >>>>> This used to be experimental. It is not anymore, namely, you can 
> >> >> >>>>> train
> >> >> >>>>> a name finder model for more than one entity type. The models
> >> >> >>>>> available were trained with rather old newswire data so I would
> >> >> >>>>> recommend you to obtain train new models using OpenNLP:
> >> >> >>>>>
> >> >> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
> >> >> >>>>>
> >> >> >>>>> I suppose you do not have manually annotated training data so I 
> >> >> >>>>> could
> >> >> >>>>> recommend to get the Ontonotes corpus.
> >> >> >>>>>
> >> >> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19
> >> >> >>>>>
> >> >> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
> >> >> >>>>>
> >> >> >>>>> Another option is to get a silver standard corpus obtained
> >> >> >>>>> automatically from the Wikipedia:
> >> >> >>>>>
> >> >> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
> >> >> >>>>>
> >> >> >>>>> For Dutch, Spanish, German and Italian (that I know of) there are 
> >> >> >>>>> free
> >> >> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 
> >> >> >>>>> 2009.
> >> >> >>>>>
> >> >> >>>>>> This question isn't related to the name finder and I don't think 
> >> >> >>>>>> it is possible but thought I would ask anyway. If I had two 
> >> >> >>>>>> sentences say 'Jack climbed the hill. He was very tired.' Is 
> >> >> >>>>>> there any way to know that the pronoun, he, at the start of the 
> >> >> >>>>>> second sentence is actually about Jack the subject of the first 
> >> >> >>>>>> sentence? I know in this simple case it is obvious but I am 
> >> >> >>>>>> wondering if there is anything in the OpenNLP software that will 
> >> >> >>>>>> help with this?
> >> >> >>>>> The example you mentioned is called "pronominal anaphora" and it
> >> >> >>>>> generalizes in the coreference resolution problem. There used to 
> >> >> >>>>> be a
> >> >> >>>>> coreference tool in OpenNLP but got moved to the Sandbox because 
> >> >> >>>>> many
> >> >> >>>>> things need to be updated to be able to distribute it.
> >> >> >>>>>
> >> >> >>>>> See http://conll.cemantix.org/2012/introduction.html for more 
> >> >> >>>>> details.
> >> >> >>>>>
> >> >> >>>>> HTH,
> >> >> >>>>>
> >> >> >>>>> R
> >> >> >>
> >> >> >
> >> >>
> >> >> --
> >> >> Dr. Thomas Zastrow
> >> >> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG)
> >> >> Gießenbachstr. 2, D-85748 Garching bei München, Germany
> >> >> Tel +49-89-3299-1457
> >> >> http://www.rzg.mpg.de
> >> >>
> >> >
> >

RE: Name finder questions

Reply via email to