Ok thanks. I guess it was my inexperience that was making me think it wasn't named entity
So in one of the files I see Daniel|NNP|I-PER Guerin|NNP|I-PER So I would need to parse and remove the POS tags and replace the |I-PER to <START:Person><END> and that would do the job. Thanks, that helps me a lot. Robert > From: [email protected] > Date: Mon, 25 Apr 2016 16:53:28 +0200 > Subject: Re: Name finder questions > To: [email protected] > > Hi, > > It is much easier to try with a corpus that is already available. The > links I sent are about Named Entities, and they all contain persons, > locations and organizations. The idea is obtain (one of) those corpora > and format it to OpenNLP format to train a new model. If that does not > work for you (e.g., the output is very bad) then maybe you could > consider annotating your own data. But that takes time. > > HTH, > > R > > On Mon, Apr 25, 2016 at 4:32 PM, Robert Logue <[email protected]> wrote: > > I sure did, thanks. I was more unsure if these would work as well for > > sports specifically or would it be best to make my own? > > > > I may have missed something but they are also unclear what the files are > > for ie is it a model file for. The ones I downloaded and looked at seemed > > to be POS tagging rather than named entity tagging. May my inexperience is > > making me miss something? > > > > Thanks, > > Robert > > > > > > > >> From: [email protected] > >> Date: Mon, 25 Apr 2016 15:43:23 +0200 > >> Subject: Re: Name finder questions > >> To: [email protected] > >> > >> Did you look at the links I sent in a previous email? > >> > >> R > >> > >> On Mon, Apr 25, 2016 at 3:10 PM, Robert Logue <[email protected]> > >> wrote: > >> > The area I would be looking in would be sports and the only things I > >> > would be interested in would be the 3 things I mentioned ie > >> > > >> > People, Organizations and Location > >> > > >> > Do you think there is existing corpora that would cover this? Or would > >> > there be benefit in creating my own? > >> > > >> > Thanks, > >> > Robert > >> > > >> >> From: [email protected] > >> >> Date: Mon, 25 Apr 2016 09:39:48 +0200 > >> >> Subject: Re: Name finder questions > >> >> To: [email protected] > >> >> > >> >> Hi Robert, > >> >> > >> >> Performance varies a lot, and that is still the subject of research. > >> >> Basically, more data always helps, but depending on the type of data, > >> >> number of entity types, etc., the quantity required differs. If you > >> >> need to tag persons, locations and organizations on news or similar > >> >> text genre I recommend you to use one of the already existing corpora > >> >> and avoid tagging your own data. > >> >> > >> >> Which genre are you interested in? > >> >> > >> >> R > >> >> > >> >> On Fri, Apr 22, 2016 at 10:31 AM, Robert Logue <[email protected]> > >> >> wrote: > >> >> > Very useful, thank you. > >> >> > > >> >> > Only question I have left now, for the moment, is on performance. The > >> >> > minimum recommend number of sentences is 15,000 does anyone know how > >> >> > much this would need to be increased to before it would, maybe it > >> >> > never would, become a performance issue? So if I created training > >> >> > data with 100,000 sentences would this be an issue? Is there any > >> >> > number I could go to where it would be an issue? > >> >> > > >> >> > Thanks, > >> >> > > >> >> > Robert > >> >> > > >> >> >> Subject: Re: Name finder questions > >> >> >> To: [email protected] > >> >> >> From: [email protected] > >> >> >> Date: Fri, 22 Apr 2016 10:22:50 +0200 > >> >> >> > >> >> >> Here you can find raw data I used to create a German model, maybe its > >> >> >> useful for you: > >> >> >> > >> >> >> http://www.thomas-zastrow.de/nlp/ > >> >> >> > >> >> >> ("Raw trainingdata in OpenNLP format") > >> >> >> > >> >> >> > >> >> >> Am 22.04.2016 um 10:17 schrieb Robert Logue: > >> >> >> > Can anyone help here? I don't want to start creating a large > >> >> >> > training file and find out I have gone about it in the wrong way. > >> >> >> > > >> >> >> > The resources I have been looking at are > >> >> >> > > >> >> >> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training > >> >> >> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/ > >> >> >> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html > >> >> >> > > >> >> >> > None of which gives the answers I am looking for. > >> >> >> > > >> >> >> > Thanks, > >> >> >> > > >> >> >> > Robert > >> >> >> > > >> >> >> >> From: [email protected] > >> >> >> >> To: [email protected] > >> >> >> >> Subject: RE: Name finder questions > >> >> >> >> Date: Wed, 20 Apr 2016 09:51:25 +0100 > >> >> >> >> > >> >> >> >> I have a few questions regarding creating my own training data > >> >> >> >> for the name finder. I would like to distinguish between people, > >> >> >> >> organizations and locations. The example in the documentation > >> >> >> >> shows the tags to use for people ie > >> >> >> >> > >> >> >> >> <START:person> Pierre Vinken <END> , 61 years old , will join the > >> >> >> >> board as a nonexecutive director Nov. 29 .So would I used > >> >> >> >> <START:organization><END> and <START:location><END> for > >> >> >> >> organizations and locations respectively? The name entity > >> >> >> >> guidelines in the documentation ie > >> >> >> >> > >> >> >> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides > >> >> >> >> > >> >> >> >> seem to show different tags getting used which has confused me > >> >> >> >> slightly as to which tags I should actually use? > >> >> >> >> > >> >> >> >> Also I see the 15,000 line recommendation is there any > >> >> >> >> performance hit if you use many more lines? > >> >> >> >> > >> >> >> >> If I create my plain text training file as I outlined above is > >> >> >> >> there any other params that are recommended to use beyond the > >> >> >> >> basic ie > >> >> >> >> > >> >> >> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en > >> >> >> >> -data TRAINING_FILE.train -encoding UTF-8 > >> >> >> >> > >> >> >> >> For instance what is the -params training parameters file used > >> >> >> >> for? Is this necessary should this list the named entities I am > >> >> >> >> looking for ie person, organization and location if so what > >> >> >> >> format should it be in? > >> >> >> >> > >> >> >> >> Sorry for the basic questions here but kind find the answers in > >> >> >> >> the documentation or from a quick google. > >> >> >> >> > >> >> >> >> Thanks, > >> >> >> >> > >> >> >> >> Robert > >> >> >> >> > >> >> >> >> > >> >> >> >>> From: [email protected] > >> >> >> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200 > >> >> >> >>> Subject: Re: Name finder questions > >> >> >> >>> To: [email protected] > >> >> >> >>> > >> >> >> >>> Hello, > >> >> >> >>> > >> >> >> >>> Yes, that is the idea. > >> >> >> >>> > >> >> >> >>> R > >> >> >> >>> > >> >> >> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue > >> >> >> >>> <[email protected]> wrote: > >> >> >> >>>> I am slightly confused what I can use the data in those links > >> >> >> >>>> for? So can I use this data with the training tool like the > >> >> >> >>>> following > >> >> >> >>>> > >> >> >> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en > >> >> >> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8 > >> >> >> >>>> And that should give me a better model file for when I use the > >> >> >> >>>> name finder? > >> >> >> >>>> > >> >> >> >>>> Thanks, > >> >> >> >>>> > >> >> >> >>>> Robert > >> >> >> >>>> > >> >> >> >>>>> From: [email protected] > >> >> >> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200 > >> >> >> >>>>> Subject: Re: Name finder questions > >> >> >> >>>>> To: [email protected] > >> >> >> >>>>> > >> >> >> >>>>> Hi Robert, > >> >> >> >>>>> > >> >> >> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue > >> >> >> >>>>> <[email protected]> wrote: > >> >> >> >>>>>> Hello, > >> >> >> >>>>>> > >> >> >> >>>>>> I have just started using OpenNLP in the java application. I > >> >> >> >>>>>> am just getting my used with the software and have a couple > >> >> >> >>>>>> of newbie questions. > >> >> >> >>>>>> > >> >> >> >>>>>> I see for the name finder there is different model data for > >> >> >> >>>>>> people and organizations (en-ner-organization.bin and > >> >> >> >>>>>> en-ner-person.bin). Is there any way to combine these into > >> >> >> >>>>>> one file so I can do 1 search that will give me back person > >> >> >> >>>>>> names and organization names. Or is this not possible and is > >> >> >> >>>>>> it best to do two searches? > >> >> >> >>>>> This used to be experimental. It is not anymore, namely, you > >> >> >> >>>>> can train > >> >> >> >>>>> a name finder model for more than one entity type. The models > >> >> >> >>>>> available were trained with rather old newswire data so I would > >> >> >> >>>>> recommend you to obtain train new models using OpenNLP: > >> >> >> >>>>> > >> >> >> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool > >> >> >> >>>>> > >> >> >> >>>>> I suppose you do not have manually annotated training data so > >> >> >> >>>>> I could > >> >> >> >>>>> recommend to get the Ontonotes corpus. > >> >> >> >>>>> > >> >> >> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19 > >> >> >> >>>>> > >> >> >> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0 > >> >> >> >>>>> > >> >> >> >>>>> Another option is to get a silver standard corpus obtained > >> >> >> >>>>> automatically from the Wikipedia: > >> >> >> >>>>> > >> >> >> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia > >> >> >> >>>>> > >> >> >> >>>>> For Dutch, Spanish, German and Italian (that I know of) there > >> >> >> >>>>> are free > >> >> >> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and > >> >> >> >>>>> Evalita 2009. > >> >> >> >>>>> > >> >> >> >>>>>> This question isn't related to the name finder and I don't > >> >> >> >>>>>> think it is possible but thought I would ask anyway. If I had > >> >> >> >>>>>> two sentences say 'Jack climbed the hill. He was very tired.' > >> >> >> >>>>>> Is there any way to know that the pronoun, he, at the start > >> >> >> >>>>>> of the second sentence is actually about Jack the subject of > >> >> >> >>>>>> the first sentence? I know in this simple case it is obvious > >> >> >> >>>>>> but I am wondering if there is anything in the OpenNLP > >> >> >> >>>>>> software that will help with this? > >> >> >> >>>>> The example you mentioned is called "pronominal anaphora" and > >> >> >> >>>>> it > >> >> >> >>>>> generalizes in the coreference resolution problem. There used > >> >> >> >>>>> to be a > >> >> >> >>>>> coreference tool in OpenNLP but got moved to the Sandbox > >> >> >> >>>>> because many > >> >> >> >>>>> things need to be updated to be able to distribute it. > >> >> >> >>>>> > >> >> >> >>>>> See http://conll.cemantix.org/2012/introduction.html for more > >> >> >> >>>>> details. > >> >> >> >>>>> > >> >> >> >>>>> HTH, > >> >> >> >>>>> > >> >> >> >>>>> R > >> >> >> >> > >> >> >> > > >> >> >> > >> >> >> -- > >> >> >> Dr. Thomas Zastrow > >> >> >> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG) > >> >> >> Gießenbachstr. 2, D-85748 Garching bei München, Germany > >> >> >> Tel +49-89-3299-1457 > >> >> >> http://www.rzg.mpg.de > >> >> >> > >> >> > > >> > > >
