Thanks a lot. I did create a name finder training data file using CONNEL sometimes ago. Will have a look at how I did it. I may be able to convert this training file to produce a tokenizer training file.
Thanks a lot. Will let you know how I get on with this. Would contribute anything that might be useful to others. Rohana -----Original Message----- From: Jörn Kottmann [mailto:[email protected]] Sent: 24 February 2011 10:59 To: [email protected] Subject: Re: Tokenizer issue - Quotation marks On 2/23/11 2:33 PM, Rohana Rajapakse wrote: > Yes I do have Reuters corpus with me. Also used Browns corpus (subset of > Reuters?) some times ago. So, I should have that as well. Would like to know > the steps to create a training set. > For CONLL03 (uses Reuters data) we created a converter to produce name finder training material. The process how to do that is described in our docbook (just build it or download the release candidate). After you produced the name finder training file you can use the converter for the tokenizer and a detokenizer file to produce training data for the tokenizer. There is a sample detokenizer file which maybe must be extended a little for english, maybe we should create a new folder within the tools project to collect all the non-statistical model files. I guess a good detokenizer for the reuters corpus could be useful for others too. That can be done with a command like this bin/opennlp TokenizerConverter namefinder -encoding UTF-8 -data en-ner-reuters.train -detokenizer lating-detokenizer.xml > en-tok.train Hope that helps to get you started, contribution about this to the documentation is very welcome. Jörn Mail about the RC: http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201102.mbox/%[email protected]%3E GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email. Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
