Re: Tokenizer issue - Quotation marks

Jörn Kottmann Thu, 24 Feb 2011 02:59:09 -0800

On 2/23/11 2:33 PM, Rohana Rajapakse wrote:

Yes I do have Reuters corpus with me. Also used Browns corpus (subset of 
Reuters?) some times ago. So, I should have that as well. Would like to know 
the steps to create a training set.

For CONLL03 (uses Reuters data) we created a converter to produce namefindertraining material. The process how to do that is described in ourdocbook (just build

it or download the release candidate).

After you produced the name finder training file you can use theconverter forthe tokenizer and a detokenizer file to produce training data for thetokenizer.There is a sample detokenizer file which maybe must be extended a littlefor english,maybe we should create a new folder within the tools project to collectall thenon-statistical model files. I guess a good detokenizer for the reuterscorpus could be

useful for others too.

That can be done with a command like this

bin/opennlp TokenizerConverter namefinder -encoding UTF-8 -dataen-ner-reuters.train -detokenizer lating-detokenizer.xml > en-tok.train

Hope that helps to get you started, contribution about this to thedocumentation is very welcome.


Jörn

Mail about the RC:
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201102.mbox/%[email protected]%3E

Re: Tokenizer issue - Quotation marks

Reply via email to