On 2/23/11 2:33 PM, Rohana Rajapakse wrote:
Yes I do have Reuters corpus with me. Also used Browns corpus (subset of 
Reuters?) some times ago. So, I should have that as well. Would like to know 
the steps to create a training set.

For CONLL03 (uses Reuters data) we created a converter to produce name finder training material. The process how to do that is described in our docbook (just build
it or download the release candidate).

After you produced the name finder training file you can use the converter for the tokenizer and a detokenizer file to produce training data for the tokenizer. There is a sample detokenizer file which maybe must be extended a little for english, maybe we should create a new folder within the tools project to collect all the non-statistical model files. I guess a good detokenizer for the reuters corpus could be
useful for others too.

That can be done with a command like this
bin/opennlp TokenizerConverter namefinder -encoding UTF-8 -data en-ner-reuters.train -detokenizer lating-detokenizer.xml > en-tok.train

Hope that helps to get you started, contribution about this to the documentation is very welcome.

Jörn

Mail about the RC:
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201102.mbox/%[email protected]%3E

Reply via email to