Re: Training NameFinders and person names detection

Daniel Wed, 04 Jul 2012 05:03:47 -0700

I already did it and uploaded it here:

https://issues.apache.org/jira/browse/OPENNLP-46


2012/7/4, Daniel <[email protected]>:
> I already have done to work it for both languages of Conll2002:
> spanish and dutch, so I think that I could do some documentation about
> how to use the converters for CONLL 2002. I'll send documentation to
> this list when I finish it.
>
>
>
>
> 2012/7/4, Jörn Kottmann <[email protected]>:
>> We are still missing some documentation on how the training on
>> conll02 can be done. Would be nice to receive a patch for it.
>>
>> Jörn
>>
>> On 07/04/2012 12:47 PM, Daniel wrote:
>>> Nevermind my last mail, I found well-formated spanish training files
>>> in this link:
>>>
>>> http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html
>>>
>>> Its a bit confusing how its explained here:
>>>
>>> http://www.cnts.ua.ac.be/conll2002/ner/
>>>
>>> because Dutch data are well-formated in this file
>>> http://www.cnts.ua.ac.be/conll2002/ner.tgz but Spanish data aren't
>>> well-formated there.
>>>
>>>
>>>
>>> 2012/7/4, Daniel <[email protected]>:
>>>> I see, so I can't train the existing OpenNLP model for detect person
>>>> names "es-ner-person.bin"....I would need the .train file that OpenNLP
>>>> used to create this model, and concatenate that file with my new
>>>> trainning files, isnt it?
>>>>
>>>> OpenNLP used conll2002 data to create "es-ner-person.bin", so I have
>>>> downloaded it from here http://www.cnts.ua.ac.be/conll2002/ner.tgz but
>>>> Im not able to use "esp.train", because when I run it
>>>>
>>>> C:\>opennlp TokenNameFinderTrainer -lang es -data esp.train -model
>>>> es_person.bin
>>>>
>>>> I get this error:
>>>>
>>>> java.lang.IllegalArgumentException: Model not compatible with name
>>>> finder!
>>>>
>>>>
>>>> so I guess that I must convert this data file to OpenNLP format, but I
>>>> use:
>>>>
>>>> C:\>opennlp TokenNameFinderConverter conll02 -data esp.train -lang es
>>>> -types per > corpus_train.txt
>>>>
>>>> and I get this error:
>>>>
>>>> IO error while reading training data or indexing data: Expected three
>>>> fields per line in training data!
>>>>
>>>>
>>>> 2012/7/4, Jörn Kottmann <[email protected]>:
>>>>> On 07/04/2012 08:18 AM, Daniel wrote:
>>>>>> I have a easy question about training NameFinders, can I use 5-6
>>>>>> different training files to train a NameFinderME? or I only can use
>>>>>> one training file to generate one model.bin?
>>>>> You need to concatenate the files for the cli tools.
>>>>>
>>>>>> And one last question, if I want that my application detects english
>>>>>> person names and spanish person names, should I use
>>>>>> "es-ner-person.bin" and "en-ner-person.bin"? or these models are 100%
>>>>>> dependent on language, so if my text is in spanish language, I only
>>>>>> have to use "es-ner-person.bin"?
>>>>> I usually detect the language before with our Document Categorizer,
>>>>> and then use the model trained for the language.
>>>>>
>>>>> You can also try to train one name finder for both languages.
>>>>>
>>>>> Jörn
>>>>>
>>>>>
>>
>>
>>
>

Re: Training NameFinders and person names detection

Reply via email to