I already did it and uploaded it here: https://issues.apache.org/jira/browse/OPENNLP-46
2012/7/4, Daniel <[email protected]>: > I already have done to work it for both languages of Conll2002: > spanish and dutch, so I think that I could do some documentation about > how to use the converters for CONLL 2002. I'll send documentation to > this list when I finish it. > > > > > 2012/7/4, Jörn Kottmann <[email protected]>: >> We are still missing some documentation on how the training on >> conll02 can be done. Would be nice to receive a patch for it. >> >> Jörn >> >> On 07/04/2012 12:47 PM, Daniel wrote: >>> Nevermind my last mail, I found well-formated spanish training files >>> in this link: >>> >>> http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html >>> >>> Its a bit confusing how its explained here: >>> >>> http://www.cnts.ua.ac.be/conll2002/ner/ >>> >>> because Dutch data are well-formated in this file >>> http://www.cnts.ua.ac.be/conll2002/ner.tgz but Spanish data aren't >>> well-formated there. >>> >>> >>> >>> 2012/7/4, Daniel <[email protected]>: >>>> I see, so I can't train the existing OpenNLP model for detect person >>>> names "es-ner-person.bin"....I would need the .train file that OpenNLP >>>> used to create this model, and concatenate that file with my new >>>> trainning files, isnt it? >>>> >>>> OpenNLP used conll2002 data to create "es-ner-person.bin", so I have >>>> downloaded it from here http://www.cnts.ua.ac.be/conll2002/ner.tgz but >>>> Im not able to use "esp.train", because when I run it >>>> >>>> C:\>opennlp TokenNameFinderTrainer -lang es -data esp.train -model >>>> es_person.bin >>>> >>>> I get this error: >>>> >>>> java.lang.IllegalArgumentException: Model not compatible with name >>>> finder! >>>> >>>> >>>> so I guess that I must convert this data file to OpenNLP format, but I >>>> use: >>>> >>>> C:\>opennlp TokenNameFinderConverter conll02 -data esp.train -lang es >>>> -types per > corpus_train.txt >>>> >>>> and I get this error: >>>> >>>> IO error while reading training data or indexing data: Expected three >>>> fields per line in training data! >>>> >>>> >>>> 2012/7/4, Jörn Kottmann <[email protected]>: >>>>> On 07/04/2012 08:18 AM, Daniel wrote: >>>>>> I have a easy question about training NameFinders, can I use 5-6 >>>>>> different training files to train a NameFinderME? or I only can use >>>>>> one training file to generate one model.bin? >>>>> You need to concatenate the files for the cli tools. >>>>> >>>>>> And one last question, if I want that my application detects english >>>>>> person names and spanish person names, should I use >>>>>> "es-ner-person.bin" and "en-ner-person.bin"? or these models are 100% >>>>>> dependent on language, so if my text is in spanish language, I only >>>>>> have to use "es-ner-person.bin"? >>>>> I usually detect the language before with our Document Categorizer, >>>>> and then use the model trained for the language. >>>>> >>>>> You can also try to train one name finder for both languages. >>>>> >>>>> Jörn >>>>> >>>>> >> >> >> >
