The area I would be looking in would be sports and the only things I would be interested in would be the 3 things I mentioned ie
People, Organizations and Location Do you think there is existing corpora that would cover this? Or would there be benefit in creating my own? Thanks, Robert > From: [email protected] > Date: Mon, 25 Apr 2016 09:39:48 +0200 > Subject: Re: Name finder questions > To: [email protected] > > Hi Robert, > > Performance varies a lot, and that is still the subject of research. > Basically, more data always helps, but depending on the type of data, > number of entity types, etc., the quantity required differs. If you > need to tag persons, locations and organizations on news or similar > text genre I recommend you to use one of the already existing corpora > and avoid tagging your own data. > > Which genre are you interested in? > > R > > On Fri, Apr 22, 2016 at 10:31 AM, Robert Logue <[email protected]> wrote: > > Very useful, thank you. > > > > Only question I have left now, for the moment, is on performance. The > > minimum recommend number of sentences is 15,000 does anyone know how much > > this would need to be increased to before it would, maybe it never would, > > become a performance issue? So if I created training data with 100,000 > > sentences would this be an issue? Is there any number I could go to where > > it would be an issue? > > > > Thanks, > > > > Robert > > > >> Subject: Re: Name finder questions > >> To: [email protected] > >> From: [email protected] > >> Date: Fri, 22 Apr 2016 10:22:50 +0200 > >> > >> Here you can find raw data I used to create a German model, maybe its > >> useful for you: > >> > >> http://www.thomas-zastrow.de/nlp/ > >> > >> ("Raw trainingdata in OpenNLP format") > >> > >> > >> Am 22.04.2016 um 10:17 schrieb Robert Logue: > >> > Can anyone help here? I don't want to start creating a large training > >> > file and find out I have gone about it in the wrong way. > >> > > >> > The resources I have been looking at are > >> > > >> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training > >> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/ > >> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html > >> > > >> > None of which gives the answers I am looking for. > >> > > >> > Thanks, > >> > > >> > Robert > >> > > >> >> From: [email protected] > >> >> To: [email protected] > >> >> Subject: RE: Name finder questions > >> >> Date: Wed, 20 Apr 2016 09:51:25 +0100 > >> >> > >> >> I have a few questions regarding creating my own training data for the > >> >> name finder. I would like to distinguish between people, organizations > >> >> and locations. The example in the documentation shows the tags to use > >> >> for people ie > >> >> > >> >> <START:person> Pierre Vinken <END> , 61 years old , will join the board > >> >> as a nonexecutive director Nov. 29 .So would I used > >> >> <START:organization><END> and <START:location><END> for organizations > >> >> and locations respectively? The name entity guidelines in the > >> >> documentation ie > >> >> > >> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides > >> >> > >> >> seem to show different tags getting used which has confused me slightly > >> >> as to which tags I should actually use? > >> >> > >> >> Also I see the 15,000 line recommendation is there any performance hit > >> >> if you use many more lines? > >> >> > >> >> If I create my plain text training file as I outlined above is there > >> >> any other params that are recommended to use beyond the basic ie > >> >> > >> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data > >> >> TRAINING_FILE.train -encoding UTF-8 > >> >> > >> >> For instance what is the -params training parameters file used for? Is > >> >> this necessary should this list the named entities I am looking for ie > >> >> person, organization and location if so what format should it be in? > >> >> > >> >> Sorry for the basic questions here but kind find the answers in the > >> >> documentation or from a quick google. > >> >> > >> >> Thanks, > >> >> > >> >> Robert > >> >> > >> >> > >> >>> From: [email protected] > >> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200 > >> >>> Subject: Re: Name finder questions > >> >>> To: [email protected] > >> >>> > >> >>> Hello, > >> >>> > >> >>> Yes, that is the idea. > >> >>> > >> >>> R > >> >>> > >> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <[email protected]> > >> >>> wrote: > >> >>>> I am slightly confused what I can use the data in those links for? So > >> >>>> can I use this data with the training tool like the following > >> >>>> > >> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en > >> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8 > >> >>>> And that should give me a better model file for when I use the name > >> >>>> finder? > >> >>>> > >> >>>> Thanks, > >> >>>> > >> >>>> Robert > >> >>>> > >> >>>>> From: [email protected] > >> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200 > >> >>>>> Subject: Re: Name finder questions > >> >>>>> To: [email protected] > >> >>>>> > >> >>>>> Hi Robert, > >> >>>>> > >> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue > >> >>>>> <[email protected]> wrote: > >> >>>>>> Hello, > >> >>>>>> > >> >>>>>> I have just started using OpenNLP in the java application. I am > >> >>>>>> just getting my used with the software and have a couple of newbie > >> >>>>>> questions. > >> >>>>>> > >> >>>>>> I see for the name finder there is different model data for people > >> >>>>>> and organizations (en-ner-organization.bin and en-ner-person.bin). > >> >>>>>> Is there any way to combine these into one file so I can do 1 > >> >>>>>> search that will give me back person names and organization names. > >> >>>>>> Or is this not possible and is it best to do two searches? > >> >>>>> This used to be experimental. It is not anymore, namely, you can > >> >>>>> train > >> >>>>> a name finder model for more than one entity type. The models > >> >>>>> available were trained with rather old newswire data so I would > >> >>>>> recommend you to obtain train new models using OpenNLP: > >> >>>>> > >> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool > >> >>>>> > >> >>>>> I suppose you do not have manually annotated training data so I could > >> >>>>> recommend to get the Ontonotes corpus. > >> >>>>> > >> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19 > >> >>>>> > >> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0 > >> >>>>> > >> >>>>> Another option is to get a silver standard corpus obtained > >> >>>>> automatically from the Wikipedia: > >> >>>>> > >> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia > >> >>>>> > >> >>>>> For Dutch, Spanish, German and Italian (that I know of) there are > >> >>>>> free > >> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita > >> >>>>> 2009. > >> >>>>> > >> >>>>>> This question isn't related to the name finder and I don't think it > >> >>>>>> is possible but thought I would ask anyway. If I had two sentences > >> >>>>>> say 'Jack climbed the hill. He was very tired.' Is there any way to > >> >>>>>> know that the pronoun, he, at the start of the second sentence is > >> >>>>>> actually about Jack the subject of the first sentence? I know in > >> >>>>>> this simple case it is obvious but I am wondering if there is > >> >>>>>> anything in the OpenNLP software that will help with this? > >> >>>>> The example you mentioned is called "pronominal anaphora" and it > >> >>>>> generalizes in the coreference resolution problem. There used to be a > >> >>>>> coreference tool in OpenNLP but got moved to the Sandbox because many > >> >>>>> things need to be updated to be able to distribute it. > >> >>>>> > >> >>>>> See http://conll.cemantix.org/2012/introduction.html for more > >> >>>>> details. > >> >>>>> > >> >>>>> HTH, > >> >>>>> > >> >>>>> R > >> >> > >> > > >> > >> -- > >> Dr. Thomas Zastrow > >> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG) > >> Gießenbachstr. 2, D-85748 Garching bei München, Germany > >> Tel +49-89-3299-1457 > >> http://www.rzg.mpg.de > >> > >
