I already have done to work it for both languages of Conll2002:
spanish and dutch, so I think that I could do some documentation about
how to use the converters for CONLL 2002. I'll send documentation to
this list when I finish it.




2012/7/4, Jörn Kottmann <[email protected]>:
> We are still missing some documentation on how the training on
> conll02 can be done. Would be nice to receive a patch for it.
>
> Jörn
>
> On 07/04/2012 12:47 PM, Daniel wrote:
>> Nevermind my last mail, I found well-formated spanish training files
>> in this link:
>>
>> http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html
>>
>> Its a bit confusing how its explained here:
>>
>> http://www.cnts.ua.ac.be/conll2002/ner/
>>
>> because Dutch data are well-formated in this file
>> http://www.cnts.ua.ac.be/conll2002/ner.tgz but Spanish data aren't
>> well-formated there.
>>
>>
>>
>> 2012/7/4, Daniel <[email protected]>:
>>> I see, so I can't train the existing OpenNLP model for detect person
>>> names "es-ner-person.bin"....I would need the .train file that OpenNLP
>>> used to create this model, and concatenate that file with my new
>>> trainning files, isnt it?
>>>
>>> OpenNLP used conll2002 data to create "es-ner-person.bin", so I have
>>> downloaded it from here http://www.cnts.ua.ac.be/conll2002/ner.tgz but
>>> Im not able to use "esp.train", because when I run it
>>>
>>> C:\>opennlp TokenNameFinderTrainer -lang es -data esp.train -model
>>> es_person.bin
>>>
>>> I get this error:
>>>
>>> java.lang.IllegalArgumentException: Model not compatible with name
>>> finder!
>>>
>>>
>>> so I guess that I must convert this data file to OpenNLP format, but I
>>> use:
>>>
>>> C:\>opennlp TokenNameFinderConverter conll02 -data esp.train -lang es
>>> -types per > corpus_train.txt
>>>
>>> and I get this error:
>>>
>>> IO error while reading training data or indexing data: Expected three
>>> fields per line in training data!
>>>
>>>
>>> 2012/7/4, Jörn Kottmann <[email protected]>:
>>>> On 07/04/2012 08:18 AM, Daniel wrote:
>>>>> I have a easy question about training NameFinders, can I use 5-6
>>>>> different training files to train a NameFinderME? or I only can use
>>>>> one training file to generate one model.bin?
>>>> You need to concatenate the files for the cli tools.
>>>>
>>>>> And one last question, if I want that my application detects english
>>>>> person names and spanish person names, should I use
>>>>> "es-ner-person.bin" and "en-ner-person.bin"? or these models are 100%
>>>>> dependent on language, so if my text is in spanish language, I only
>>>>> have to use "es-ner-person.bin"?
>>>> I usually detect the language before with our Document Categorizer,
>>>> and then use the model trained for the language.
>>>>
>>>> You can also try to train one name finder for both languages.
>>>>
>>>> Jörn
>>>>
>>>>
>
>
>

Reply via email to