Nevermind my last mail, I found well-formated spanish training files in this link:
http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html Its a bit confusing how its explained here: http://www.cnts.ua.ac.be/conll2002/ner/ because Dutch data are well-formated in this file http://www.cnts.ua.ac.be/conll2002/ner.tgz but Spanish data aren't well-formated there. 2012/7/4, Daniel <[email protected]>: > I see, so I can't train the existing OpenNLP model for detect person > names "es-ner-person.bin"....I would need the .train file that OpenNLP > used to create this model, and concatenate that file with my new > trainning files, isnt it? > > OpenNLP used conll2002 data to create "es-ner-person.bin", so I have > downloaded it from here http://www.cnts.ua.ac.be/conll2002/ner.tgz but > Im not able to use "esp.train", because when I run it > > C:\>opennlp TokenNameFinderTrainer -lang es -data esp.train -model > es_person.bin > > I get this error: > > java.lang.IllegalArgumentException: Model not compatible with name finder! > > > so I guess that I must convert this data file to OpenNLP format, but I use: > > C:\>opennlp TokenNameFinderConverter conll02 -data esp.train -lang es > -types per > corpus_train.txt > > and I get this error: > > IO error while reading training data or indexing data: Expected three > fields per line in training data! > > > 2012/7/4, Jörn Kottmann <[email protected]>: >> On 07/04/2012 08:18 AM, Daniel wrote: >>> I have a easy question about training NameFinders, can I use 5-6 >>> different training files to train a NameFinderME? or I only can use >>> one training file to generate one model.bin? >> >> You need to concatenate the files for the cli tools. >> >>> And one last question, if I want that my application detects english >>> person names and spanish person names, should I use >>> "es-ner-person.bin" and "en-ner-person.bin"? or these models are 100% >>> dependent on language, so if my text is in spanish language, I only >>> have to use "es-ner-person.bin"? >> >> I usually detect the language before with our Document Categorizer, >> and then use the model trained for the language. >> >> You can also try to train one name finder for both languages. >> >> Jörn >> >> >
