I already have done to work it for both languages of Conll2002: spanish and dutch, so I think that I could do some documentation about how to use the converters for CONLL 2002. I'll send documentation to this list when I finish it.
2012/7/4, Jörn Kottmann <[email protected]>: > We are still missing some documentation on how the training on > conll02 can be done. Would be nice to receive a patch for it. > > Jörn > > On 07/04/2012 12:47 PM, Daniel wrote: >> Nevermind my last mail, I found well-formated spanish training files >> in this link: >> >> http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html >> >> Its a bit confusing how its explained here: >> >> http://www.cnts.ua.ac.be/conll2002/ner/ >> >> because Dutch data are well-formated in this file >> http://www.cnts.ua.ac.be/conll2002/ner.tgz but Spanish data aren't >> well-formated there. >> >> >> >> 2012/7/4, Daniel <[email protected]>: >>> I see, so I can't train the existing OpenNLP model for detect person >>> names "es-ner-person.bin"....I would need the .train file that OpenNLP >>> used to create this model, and concatenate that file with my new >>> trainning files, isnt it? >>> >>> OpenNLP used conll2002 data to create "es-ner-person.bin", so I have >>> downloaded it from here http://www.cnts.ua.ac.be/conll2002/ner.tgz but >>> Im not able to use "esp.train", because when I run it >>> >>> C:\>opennlp TokenNameFinderTrainer -lang es -data esp.train -model >>> es_person.bin >>> >>> I get this error: >>> >>> java.lang.IllegalArgumentException: Model not compatible with name >>> finder! >>> >>> >>> so I guess that I must convert this data file to OpenNLP format, but I >>> use: >>> >>> C:\>opennlp TokenNameFinderConverter conll02 -data esp.train -lang es >>> -types per > corpus_train.txt >>> >>> and I get this error: >>> >>> IO error while reading training data or indexing data: Expected three >>> fields per line in training data! >>> >>> >>> 2012/7/4, Jörn Kottmann <[email protected]>: >>>> On 07/04/2012 08:18 AM, Daniel wrote: >>>>> I have a easy question about training NameFinders, can I use 5-6 >>>>> different training files to train a NameFinderME? or I only can use >>>>> one training file to generate one model.bin? >>>> You need to concatenate the files for the cli tools. >>>> >>>>> And one last question, if I want that my application detects english >>>>> person names and spanish person names, should I use >>>>> "es-ner-person.bin" and "en-ner-person.bin"? or these models are 100% >>>>> dependent on language, so if my text is in spanish language, I only >>>>> have to use "es-ner-person.bin"? >>>> I usually detect the language before with our Document Categorizer, >>>> and then use the model trained for the language. >>>> >>>> You can also try to train one name finder for both languages. >>>> >>>> Jörn >>>> >>>> > > >
