On 2/23/11 2:33 PM, Rohana Rajapakse wrote:
Yes I do have Reuters corpus with me. Also used Browns corpus (subset of
Reuters?) some times ago. So, I should have that as well. Would like to know
the steps to create a training set.
For CONLL03 (uses Reuters data) we created a converter to produce name
finder
training material. The process how to do that is described in our
docbook (just build
it or download the release candidate).
After you produced the name finder training file you can use the
converter for
the tokenizer and a detokenizer file to produce training data for the
tokenizer.
There is a sample detokenizer file which maybe must be extended a little
for english,
maybe we should create a new folder within the tools project to collect
all the
non-statistical model files. I guess a good detokenizer for the reuters
corpus could be
useful for others too.
That can be done with a command like this
bin/opennlp TokenizerConverter namefinder -encoding UTF-8 -data
en-ner-reuters.train -detokenizer lating-detokenizer.xml > en-tok.train
Hope that helps to get you started, contribution about this to the
documentation is very welcome.
Jörn
Mail about the RC:
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201102.mbox/%[email protected]%3E