Re: Name finder questions

Rodrigo Agerri Mon, 25 Apr 2016 07:55:09 -0700

Hi,

It is much easier to try with a corpus that is already available. The
links I sent are about Named Entities, and they all contain persons,
locations and organizations. The idea is obtain (one of) those corpora
and format it to OpenNLP format to train a new model. If that does not
work for you (e.g., the output is very bad) then maybe you could
consider annotating your own data. But that takes time.


HTH,

R

On Mon, Apr 25, 2016 at 4:32 PM, Robert Logue <[email protected]> wrote:
> I sure did, thanks. I was more unsure if these would work as well for sports 
> specifically or would it be best to make my own?
>
> I may have missed something but they are also unclear what the files are for 
> ie is it a model file for. The ones I downloaded and looked at seemed to be 
> POS tagging rather than named entity tagging. May my inexperience is making 
> me miss something?
>
> Thanks,
> Robert
>
>
>
>> From: [email protected]
>> Date: Mon, 25 Apr 2016 15:43:23 +0200
>> Subject: Re: Name finder questions
>> To: [email protected]
>>
>> Did you look at the links I sent in a previous email?
>>
>> R
>>
>> On Mon, Apr 25, 2016 at 3:10 PM, Robert Logue <[email protected]> wrote:
>> > The area I would be looking in would be sports and the only things I would 
>> > be interested in would be the 3 things I mentioned ie
>> >
>> > People, Organizations and Location
>> >
>> > Do you think there is existing corpora that would cover this? Or would 
>> > there be benefit in creating my own?
>> >
>> > Thanks,
>> > Robert
>> >
>> >> From: [email protected]
>> >> Date: Mon, 25 Apr 2016 09:39:48 +0200
>> >> Subject: Re: Name finder questions
>> >> To: [email protected]
>> >>
>> >> Hi Robert,
>> >>
>> >> Performance varies a lot, and that is still the subject of research.
>> >> Basically, more data always helps, but depending on the type of data,
>> >> number of entity types, etc., the quantity required differs. If you
>> >> need to tag persons, locations and organizations on news or similar
>> >> text genre I recommend you to use one of the already existing corpora
>> >> and avoid tagging your own data.
>> >>
>> >> Which genre are you interested in?
>> >>
>> >> R
>> >>
>> >> On Fri, Apr 22, 2016 at 10:31 AM, Robert Logue <[email protected]> 
>> >> wrote:
>> >> > Very useful, thank you.
>> >> >
>> >> > Only question I have left now, for the moment, is on performance. The 
>> >> > minimum recommend number of sentences is 15,000 does anyone know how 
>> >> > much this would need to be increased to before it would, maybe it never 
>> >> > would, become a performance issue? So if I created training data with 
>> >> > 100,000 sentences would this be an issue? Is there any number I could 
>> >> > go to where it would be an issue?
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Robert
>> >> >
>> >> >> Subject: Re: Name finder questions
>> >> >> To: [email protected]
>> >> >> From: [email protected]
>> >> >> Date: Fri, 22 Apr 2016 10:22:50 +0200
>> >> >>
>> >> >> Here you can find raw data I used to create a German model, maybe its
>> >> >> useful for you:
>> >> >>
>> >> >> http://www.thomas-zastrow.de/nlp/
>> >> >>
>> >> >> ("Raw trainingdata in OpenNLP format")
>> >> >>
>> >> >>
>> >> >> Am 22.04.2016 um 10:17 schrieb Robert Logue:
>> >> >> > Can anyone help here? I don't want to start creating a large 
>> >> >> > training file and find out I have gone about it in the wrong way.
>> >> >> >
>> >> >> > The resources I have been looking at are
>> >> >> >
>> >> >> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
>> >> >> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
>> >> >> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html
>> >> >> >
>> >> >> > None of which gives the answers I am looking for.
>> >> >> >
>> >> >> > Thanks,
>> >> >> >
>> >> >> > Robert
>> >> >> >
>> >> >> >> From: [email protected]
>> >> >> >> To: [email protected]
>> >> >> >> Subject: RE: Name finder questions
>> >> >> >> Date: Wed, 20 Apr 2016 09:51:25 +0100
>> >> >> >>
>> >> >> >> I have a few questions regarding creating my own training data for 
>> >> >> >> the name finder. I would like to distinguish between people, 
>> >> >> >> organizations and locations. The example in the documentation shows 
>> >> >> >> the tags to use for people ie
>> >> >> >>
>> >> >> >> <START:person> Pierre Vinken <END> , 61 years old , will join the 
>> >> >> >> board as a nonexecutive director Nov. 29 .So would I used 
>> >> >> >> <START:organization><END> and <START:location><END> for 
>> >> >> >> organizations and locations respectively? The name entity 
>> >> >> >> guidelines in the documentation ie
>> >> >> >>
>> >> >> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides
>> >> >> >>
>> >> >> >> seem to show different tags getting used which has confused me 
>> >> >> >> slightly as to which tags I should actually use?
>> >> >> >>
>> >> >> >> Also I see the 15,000 line recommendation is there any performance 
>> >> >> >> hit if you use many more lines?
>> >> >> >>
>> >> >> >> If I create my plain text training file as I outlined above is 
>> >> >> >> there any other params that are recommended to use beyond the basic 
>> >> >> >> ie
>> >> >> >>
>> >> >> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en 
>> >> >> >> -data TRAINING_FILE.train -encoding UTF-8
>> >> >> >>
>> >> >> >> For instance what is the -params training parameters file used for? 
>> >> >> >> Is this necessary should this list the named entities I am looking 
>> >> >> >> for ie person, organization and location if so what format should 
>> >> >> >> it be in?
>> >> >> >>
>> >> >> >> Sorry for the basic questions here but kind find the answers in the 
>> >> >> >> documentation or from a quick google.
>> >> >> >>
>> >> >> >> Thanks,
>> >> >> >>
>> >> >> >> Robert
>> >> >> >>
>> >> >> >>
>> >> >> >>> From: [email protected]
>> >> >> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200
>> >> >> >>> Subject: Re: Name finder questions
>> >> >> >>> To: [email protected]
>> >> >> >>>
>> >> >> >>> Hello,
>> >> >> >>>
>> >> >> >>> Yes, that is the idea.
>> >> >> >>>
>> >> >> >>> R
>> >> >> >>>
>> >> >> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue 
>> >> >> >>> <[email protected]> wrote:
>> >> >> >>>> I am slightly confused what I can use the data in those links 
>> >> >> >>>> for? So can I use this data with the training tool like the 
>> >> >> >>>> following
>> >> >> >>>>
>> >> >> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
>> >> >> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8
>> >> >> >>>> And that should give me a better model file for when I use the 
>> >> >> >>>> name finder?
>> >> >> >>>>
>> >> >> >>>> Thanks,
>> >> >> >>>>
>> >> >> >>>> Robert
>> >> >> >>>>
>> >> >> >>>>> From: [email protected]
>> >> >> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200
>> >> >> >>>>> Subject: Re: Name finder questions
>> >> >> >>>>> To: [email protected]
>> >> >> >>>>>
>> >> >> >>>>> Hi Robert,
>> >> >> >>>>>
>> >> >> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue 
>> >> >> >>>>> <[email protected]> wrote:
>> >> >> >>>>>> Hello,
>> >> >> >>>>>>
>> >> >> >>>>>> I have just started using OpenNLP in the java application. I am 
>> >> >> >>>>>> just getting my used with the software and have a couple of 
>> >> >> >>>>>> newbie questions.
>> >> >> >>>>>>
>> >> >> >>>>>> I see for the name finder there is different model data for 
>> >> >> >>>>>> people and organizations (en-ner-organization.bin and 
>> >> >> >>>>>> en-ner-person.bin). Is there any way to combine these into one 
>> >> >> >>>>>> file so I can do 1 search that will give me back person names 
>> >> >> >>>>>> and organization names. Or is this not possible and is it best 
>> >> >> >>>>>> to do two searches?
>> >> >> >>>>> This used to be experimental. It is not anymore, namely, you can 
>> >> >> >>>>> train
>> >> >> >>>>> a name finder model for more than one entity type. The models
>> >> >> >>>>> available were trained with rather old newswire data so I would
>> >> >> >>>>> recommend you to obtain train new models using OpenNLP:
>> >> >> >>>>>
>> >> >> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
>> >> >> >>>>>
>> >> >> >>>>> I suppose you do not have manually annotated training data so I 
>> >> >> >>>>> could
>> >> >> >>>>> recommend to get the Ontonotes corpus.
>> >> >> >>>>>
>> >> >> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19
>> >> >> >>>>>
>> >> >> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
>> >> >> >>>>>
>> >> >> >>>>> Another option is to get a silver standard corpus obtained
>> >> >> >>>>> automatically from the Wikipedia:
>> >> >> >>>>>
>> >> >> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
>> >> >> >>>>>
>> >> >> >>>>> For Dutch, Spanish, German and Italian (that I know of) there 
>> >> >> >>>>> are free
>> >> >> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 
>> >> >> >>>>> 2009.
>> >> >> >>>>>
>> >> >> >>>>>> This question isn't related to the name finder and I don't 
>> >> >> >>>>>> think it is possible but thought I would ask anyway. If I had 
>> >> >> >>>>>> two sentences say 'Jack climbed the hill. He was very tired.' 
>> >> >> >>>>>> Is there any way to know that the pronoun, he, at the start of 
>> >> >> >>>>>> the second sentence is actually about Jack the subject of the 
>> >> >> >>>>>> first sentence? I know in this simple case it is obvious but I 
>> >> >> >>>>>> am wondering if there is anything in the OpenNLP software that 
>> >> >> >>>>>> will help with this?
>> >> >> >>>>> The example you mentioned is called "pronominal anaphora" and it
>> >> >> >>>>> generalizes in the coreference resolution problem. There used to 
>> >> >> >>>>> be a
>> >> >> >>>>> coreference tool in OpenNLP but got moved to the Sandbox because 
>> >> >> >>>>> many
>> >> >> >>>>> things need to be updated to be able to distribute it.
>> >> >> >>>>>
>> >> >> >>>>> See http://conll.cemantix.org/2012/introduction.html for more 
>> >> >> >>>>> details.
>> >> >> >>>>>
>> >> >> >>>>> HTH,
>> >> >> >>>>>
>> >> >> >>>>> R
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >> --
>> >> >> Dr. Thomas Zastrow
>> >> >> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG)
>> >> >> Gießenbachstr. 2, D-85748 Garching bei München, Germany
>> >> >> Tel +49-89-3299-1457
>> >> >> http://www.rzg.mpg.de
>> >> >>
>> >> >
>> >
>

Re: Name finder questions

Reply via email to