Did you look at the links I sent in a previous email? R
On Mon, Apr 25, 2016 at 3:10 PM, Robert Logue <[email protected]> wrote: > The area I would be looking in would be sports and the only things I would be > interested in would be the 3 things I mentioned ie > > People, Organizations and Location > > Do you think there is existing corpora that would cover this? Or would there > be benefit in creating my own? > > Thanks, > Robert > >> From: [email protected] >> Date: Mon, 25 Apr 2016 09:39:48 +0200 >> Subject: Re: Name finder questions >> To: [email protected] >> >> Hi Robert, >> >> Performance varies a lot, and that is still the subject of research. >> Basically, more data always helps, but depending on the type of data, >> number of entity types, etc., the quantity required differs. If you >> need to tag persons, locations and organizations on news or similar >> text genre I recommend you to use one of the already existing corpora >> and avoid tagging your own data. >> >> Which genre are you interested in? >> >> R >> >> On Fri, Apr 22, 2016 at 10:31 AM, Robert Logue <[email protected]> wrote: >> > Very useful, thank you. >> > >> > Only question I have left now, for the moment, is on performance. The >> > minimum recommend number of sentences is 15,000 does anyone know how much >> > this would need to be increased to before it would, maybe it never would, >> > become a performance issue? So if I created training data with 100,000 >> > sentences would this be an issue? Is there any number I could go to where >> > it would be an issue? >> > >> > Thanks, >> > >> > Robert >> > >> >> Subject: Re: Name finder questions >> >> To: [email protected] >> >> From: [email protected] >> >> Date: Fri, 22 Apr 2016 10:22:50 +0200 >> >> >> >> Here you can find raw data I used to create a German model, maybe its >> >> useful for you: >> >> >> >> http://www.thomas-zastrow.de/nlp/ >> >> >> >> ("Raw trainingdata in OpenNLP format") >> >> >> >> >> >> Am 22.04.2016 um 10:17 schrieb Robert Logue: >> >> > Can anyone help here? I don't want to start creating a large training >> >> > file and find out I have gone about it in the wrong way. >> >> > >> >> > The resources I have been looking at are >> >> > >> >> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training >> >> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/ >> >> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html >> >> > >> >> > None of which gives the answers I am looking for. >> >> > >> >> > Thanks, >> >> > >> >> > Robert >> >> > >> >> >> From: [email protected] >> >> >> To: [email protected] >> >> >> Subject: RE: Name finder questions >> >> >> Date: Wed, 20 Apr 2016 09:51:25 +0100 >> >> >> >> >> >> I have a few questions regarding creating my own training data for the >> >> >> name finder. I would like to distinguish between people, organizations >> >> >> and locations. The example in the documentation shows the tags to use >> >> >> for people ie >> >> >> >> >> >> <START:person> Pierre Vinken <END> , 61 years old , will join the >> >> >> board as a nonexecutive director Nov. 29 .So would I used >> >> >> <START:organization><END> and <START:location><END> for organizations >> >> >> and locations respectively? The name entity guidelines in the >> >> >> documentation ie >> >> >> >> >> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides >> >> >> >> >> >> seem to show different tags getting used which has confused me >> >> >> slightly as to which tags I should actually use? >> >> >> >> >> >> Also I see the 15,000 line recommendation is there any performance hit >> >> >> if you use many more lines? >> >> >> >> >> >> If I create my plain text training file as I outlined above is there >> >> >> any other params that are recommended to use beyond the basic ie >> >> >> >> >> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data >> >> >> TRAINING_FILE.train -encoding UTF-8 >> >> >> >> >> >> For instance what is the -params training parameters file used for? Is >> >> >> this necessary should this list the named entities I am looking for ie >> >> >> person, organization and location if so what format should it be in? >> >> >> >> >> >> Sorry for the basic questions here but kind find the answers in the >> >> >> documentation or from a quick google. >> >> >> >> >> >> Thanks, >> >> >> >> >> >> Robert >> >> >> >> >> >> >> >> >>> From: [email protected] >> >> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200 >> >> >>> Subject: Re: Name finder questions >> >> >>> To: [email protected] >> >> >>> >> >> >>> Hello, >> >> >>> >> >> >>> Yes, that is the idea. >> >> >>> >> >> >>> R >> >> >>> >> >> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <[email protected]> >> >> >>> wrote: >> >> >>>> I am slightly confused what I can use the data in those links for? >> >> >>>> So can I use this data with the training tool like the following >> >> >>>> >> >> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en >> >> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8 >> >> >>>> And that should give me a better model file for when I use the name >> >> >>>> finder? >> >> >>>> >> >> >>>> Thanks, >> >> >>>> >> >> >>>> Robert >> >> >>>> >> >> >>>>> From: [email protected] >> >> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200 >> >> >>>>> Subject: Re: Name finder questions >> >> >>>>> To: [email protected] >> >> >>>>> >> >> >>>>> Hi Robert, >> >> >>>>> >> >> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue >> >> >>>>> <[email protected]> wrote: >> >> >>>>>> Hello, >> >> >>>>>> >> >> >>>>>> I have just started using OpenNLP in the java application. I am >> >> >>>>>> just getting my used with the software and have a couple of newbie >> >> >>>>>> questions. >> >> >>>>>> >> >> >>>>>> I see for the name finder there is different model data for people >> >> >>>>>> and organizations (en-ner-organization.bin and en-ner-person.bin). >> >> >>>>>> Is there any way to combine these into one file so I can do 1 >> >> >>>>>> search that will give me back person names and organization names. >> >> >>>>>> Or is this not possible and is it best to do two searches? >> >> >>>>> This used to be experimental. It is not anymore, namely, you can >> >> >>>>> train >> >> >>>>> a name finder model for more than one entity type. The models >> >> >>>>> available were trained with rather old newswire data so I would >> >> >>>>> recommend you to obtain train new models using OpenNLP: >> >> >>>>> >> >> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool >> >> >>>>> >> >> >>>>> I suppose you do not have manually annotated training data so I >> >> >>>>> could >> >> >>>>> recommend to get the Ontonotes corpus. >> >> >>>>> >> >> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19 >> >> >>>>> >> >> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0 >> >> >>>>> >> >> >>>>> Another option is to get a silver standard corpus obtained >> >> >>>>> automatically from the Wikipedia: >> >> >>>>> >> >> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia >> >> >>>>> >> >> >>>>> For Dutch, Spanish, German and Italian (that I know of) there are >> >> >>>>> free >> >> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita >> >> >>>>> 2009. >> >> >>>>> >> >> >>>>>> This question isn't related to the name finder and I don't think >> >> >>>>>> it is possible but thought I would ask anyway. If I had two >> >> >>>>>> sentences say 'Jack climbed the hill. He was very tired.' Is there >> >> >>>>>> any way to know that the pronoun, he, at the start of the second >> >> >>>>>> sentence is actually about Jack the subject of the first sentence? >> >> >>>>>> I know in this simple case it is obvious but I am wondering if >> >> >>>>>> there is anything in the OpenNLP software that will help with this? >> >> >>>>> The example you mentioned is called "pronominal anaphora" and it >> >> >>>>> generalizes in the coreference resolution problem. There used to be >> >> >>>>> a >> >> >>>>> coreference tool in OpenNLP but got moved to the Sandbox because >> >> >>>>> many >> >> >>>>> things need to be updated to be able to distribute it. >> >> >>>>> >> >> >>>>> See http://conll.cemantix.org/2012/introduction.html for more >> >> >>>>> details. >> >> >>>>> >> >> >>>>> HTH, >> >> >>>>> >> >> >>>>> R >> >> >> >> >> > >> >> >> >> -- >> >> Dr. Thomas Zastrow >> >> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG) >> >> Gießenbachstr. 2, D-85748 Garching bei München, Germany >> >> Tel +49-89-3299-1457 >> >> http://www.rzg.mpg.de >> >> >> > >
