Did you look at the links I sent in a previous email?

R

On Mon, Apr 25, 2016 at 3:10 PM, Robert Logue <[email protected]> wrote:
> The area I would be looking in would be sports and the only things I would be 
> interested in would be the 3 things I mentioned ie
>
> People, Organizations and Location
>
> Do you think there is existing corpora that would cover this? Or would there 
> be benefit in creating my own?
>
> Thanks,
> Robert
>
>> From: [email protected]
>> Date: Mon, 25 Apr 2016 09:39:48 +0200
>> Subject: Re: Name finder questions
>> To: [email protected]
>>
>> Hi Robert,
>>
>> Performance varies a lot, and that is still the subject of research.
>> Basically, more data always helps, but depending on the type of data,
>> number of entity types, etc., the quantity required differs. If you
>> need to tag persons, locations and organizations on news or similar
>> text genre I recommend you to use one of the already existing corpora
>> and avoid tagging your own data.
>>
>> Which genre are you interested in?
>>
>> R
>>
>> On Fri, Apr 22, 2016 at 10:31 AM, Robert Logue <[email protected]> wrote:
>> > Very useful, thank you.
>> >
>> > Only question I have left now, for the moment, is on performance. The 
>> > minimum recommend number of sentences is 15,000 does anyone know how much 
>> > this would need to be increased to before it would, maybe it never would, 
>> > become a performance issue? So if I created training data with 100,000 
>> > sentences would this be an issue? Is there any number I could go to where 
>> > it would be an issue?
>> >
>> > Thanks,
>> >
>> > Robert
>> >
>> >> Subject: Re: Name finder questions
>> >> To: [email protected]
>> >> From: [email protected]
>> >> Date: Fri, 22 Apr 2016 10:22:50 +0200
>> >>
>> >> Here you can find raw data I used to create a German model, maybe its
>> >> useful for you:
>> >>
>> >> http://www.thomas-zastrow.de/nlp/
>> >>
>> >> ("Raw trainingdata in OpenNLP format")
>> >>
>> >>
>> >> Am 22.04.2016 um 10:17 schrieb Robert Logue:
>> >> > Can anyone help here? I don't want to start creating a large training 
>> >> > file and find out I have gone about it in the wrong way.
>> >> >
>> >> > The resources I have been looking at are
>> >> >
>> >> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
>> >> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
>> >> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html
>> >> >
>> >> > None of which gives the answers I am looking for.
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Robert
>> >> >
>> >> >> From: [email protected]
>> >> >> To: [email protected]
>> >> >> Subject: RE: Name finder questions
>> >> >> Date: Wed, 20 Apr 2016 09:51:25 +0100
>> >> >>
>> >> >> I have a few questions regarding creating my own training data for the 
>> >> >> name finder. I would like to distinguish between people, organizations 
>> >> >> and locations. The example in the documentation shows the tags to use 
>> >> >> for people ie
>> >> >>
>> >> >> <START:person> Pierre Vinken <END> , 61 years old , will join the 
>> >> >> board as a nonexecutive director Nov. 29 .So would I used 
>> >> >> <START:organization><END> and <START:location><END> for organizations 
>> >> >> and locations respectively? The name entity guidelines in the 
>> >> >> documentation ie
>> >> >>
>> >> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides
>> >> >>
>> >> >> seem to show different tags getting used which has confused me 
>> >> >> slightly as to which tags I should actually use?
>> >> >>
>> >> >> Also I see the 15,000 line recommendation is there any performance hit 
>> >> >> if you use many more lines?
>> >> >>
>> >> >> If I create my plain text training file as I outlined above is there 
>> >> >> any other params that are recommended to use beyond the basic ie
>> >> >>
>> >> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data 
>> >> >> TRAINING_FILE.train -encoding UTF-8
>> >> >>
>> >> >> For instance what is the -params training parameters file used for? Is 
>> >> >> this necessary should this list the named entities I am looking for ie 
>> >> >> person, organization and location if so what format should it be in?
>> >> >>
>> >> >> Sorry for the basic questions here but kind find the answers in the 
>> >> >> documentation or from a quick google.
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Robert
>> >> >>
>> >> >>
>> >> >>> From: [email protected]
>> >> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200
>> >> >>> Subject: Re: Name finder questions
>> >> >>> To: [email protected]
>> >> >>>
>> >> >>> Hello,
>> >> >>>
>> >> >>> Yes, that is the idea.
>> >> >>>
>> >> >>> R
>> >> >>>
>> >> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <[email protected]> 
>> >> >>> wrote:
>> >> >>>> I am slightly confused what I can use the data in those links for? 
>> >> >>>> So can I use this data with the training tool like the following
>> >> >>>>
>> >> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
>> >> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8
>> >> >>>> And that should give me a better model file for when I use the name 
>> >> >>>> finder?
>> >> >>>>
>> >> >>>> Thanks,
>> >> >>>>
>> >> >>>> Robert
>> >> >>>>
>> >> >>>>> From: [email protected]
>> >> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200
>> >> >>>>> Subject: Re: Name finder questions
>> >> >>>>> To: [email protected]
>> >> >>>>>
>> >> >>>>> Hi Robert,
>> >> >>>>>
>> >> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue 
>> >> >>>>> <[email protected]> wrote:
>> >> >>>>>> Hello,
>> >> >>>>>>
>> >> >>>>>> I have just started using OpenNLP in the java application. I am 
>> >> >>>>>> just getting my used with the software and have a couple of newbie 
>> >> >>>>>> questions.
>> >> >>>>>>
>> >> >>>>>> I see for the name finder there is different model data for people 
>> >> >>>>>> and organizations (en-ner-organization.bin and en-ner-person.bin). 
>> >> >>>>>> Is there any way to combine these into one file so I can do 1 
>> >> >>>>>> search that will give me back person names and organization names. 
>> >> >>>>>> Or is this not possible and is it best to do two searches?
>> >> >>>>> This used to be experimental. It is not anymore, namely, you can 
>> >> >>>>> train
>> >> >>>>> a name finder model for more than one entity type. The models
>> >> >>>>> available were trained with rather old newswire data so I would
>> >> >>>>> recommend you to obtain train new models using OpenNLP:
>> >> >>>>>
>> >> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
>> >> >>>>>
>> >> >>>>> I suppose you do not have manually annotated training data so I 
>> >> >>>>> could
>> >> >>>>> recommend to get the Ontonotes corpus.
>> >> >>>>>
>> >> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19
>> >> >>>>>
>> >> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
>> >> >>>>>
>> >> >>>>> Another option is to get a silver standard corpus obtained
>> >> >>>>> automatically from the Wikipedia:
>> >> >>>>>
>> >> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
>> >> >>>>>
>> >> >>>>> For Dutch, Spanish, German and Italian (that I know of) there are 
>> >> >>>>> free
>> >> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 
>> >> >>>>> 2009.
>> >> >>>>>
>> >> >>>>>> This question isn't related to the name finder and I don't think 
>> >> >>>>>> it is possible but thought I would ask anyway. If I had two 
>> >> >>>>>> sentences say 'Jack climbed the hill. He was very tired.' Is there 
>> >> >>>>>> any way to know that the pronoun, he, at the start of the second 
>> >> >>>>>> sentence is actually about Jack the subject of the first sentence? 
>> >> >>>>>> I know in this simple case it is obvious but I am wondering if 
>> >> >>>>>> there is anything in the OpenNLP software that will help with this?
>> >> >>>>> The example you mentioned is called "pronominal anaphora" and it
>> >> >>>>> generalizes in the coreference resolution problem. There used to be 
>> >> >>>>> a
>> >> >>>>> coreference tool in OpenNLP but got moved to the Sandbox because 
>> >> >>>>> many
>> >> >>>>> things need to be updated to be able to distribute it.
>> >> >>>>>
>> >> >>>>> See http://conll.cemantix.org/2012/introduction.html for more 
>> >> >>>>> details.
>> >> >>>>>
>> >> >>>>> HTH,
>> >> >>>>>
>> >> >>>>> R
>> >> >>
>> >> >
>> >>
>> >> --
>> >> Dr. Thomas Zastrow
>> >> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG)
>> >> Gießenbachstr. 2, D-85748 Garching bei München, Germany
>> >> Tel +49-89-3299-1457
>> >> http://www.rzg.mpg.de
>> >>
>> >
>

Reply via email to