We are still missing some documentation on how the training on
conll02 can be done. Would be nice to receive a patch for it.
Jörn
On 07/04/2012 12:47 PM, Daniel wrote:
Nevermind my last mail, I found well-formated spanish training files
in this link:
http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html
Its a bit confusing how its explained here:
http://www.cnts.ua.ac.be/conll2002/ner/
because Dutch data are well-formated in this file
http://www.cnts.ua.ac.be/conll2002/ner.tgz but Spanish data aren't
well-formated there.
2012/7/4, Daniel <[email protected]>:
I see, so I can't train the existing OpenNLP model for detect person
names "es-ner-person.bin"....I would need the .train file that OpenNLP
used to create this model, and concatenate that file with my new
trainning files, isnt it?
OpenNLP used conll2002 data to create "es-ner-person.bin", so I have
downloaded it from here http://www.cnts.ua.ac.be/conll2002/ner.tgz but
Im not able to use "esp.train", because when I run it
C:\>opennlp TokenNameFinderTrainer -lang es -data esp.train -model
es_person.bin
I get this error:
java.lang.IllegalArgumentException: Model not compatible with name finder!
so I guess that I must convert this data file to OpenNLP format, but I use:
C:\>opennlp TokenNameFinderConverter conll02 -data esp.train -lang es
-types per > corpus_train.txt
and I get this error:
IO error while reading training data or indexing data: Expected three
fields per line in training data!
2012/7/4, Jörn Kottmann <[email protected]>:
On 07/04/2012 08:18 AM, Daniel wrote:
I have a easy question about training NameFinders, can I use 5-6
different training files to train a NameFinderME? or I only can use
one training file to generate one model.bin?
You need to concatenate the files for the cli tools.
And one last question, if I want that my application detects english
person names and spanish person names, should I use
"es-ner-person.bin" and "en-ner-person.bin"? or these models are 100%
dependent on language, so if my text is in spanish language, I only
have to use "es-ner-person.bin"?
I usually detect the language before with our Document Categorizer,
and then use the model trained for the language.
You can also try to train one name finder for both languages.
Jörn