Thanks a lot.

I did create a name finder training data file using CONNEL sometimes ago. Will 
have a look at how I did it. I may be able to convert this training file to 
produce a tokenizer training file.

Thanks a lot. Will let you know how I get on with this. Would contribute 
anything that might be useful to others.

Rohana


-----Original Message-----
From: Jörn Kottmann [mailto:[email protected]]
Sent: 24 February 2011 10:59
To: [email protected]
Subject: Re: Tokenizer issue - Quotation marks

On 2/23/11 2:33 PM, Rohana Rajapakse wrote:
> Yes I do have Reuters corpus with me. Also used Browns corpus (subset of 
> Reuters?) some times ago. So, I should have that as well. Would like to know 
> the steps to create a training set.
>
For CONLL03 (uses Reuters data) we created a converter to produce name
finder
training material. The process how to do that is described in our
docbook (just build
it or download the release candidate).

After you produced the name finder training file you can use the
converter for
the tokenizer and a detokenizer file to produce training data for the
tokenizer.
There is a sample detokenizer file which maybe must be extended a little
for english,
maybe we should create a new folder within the tools project to collect
all the
non-statistical model files. I guess a good detokenizer for the reuters
corpus could be
useful for others too.

That can be done with a command like this
bin/opennlp TokenizerConverter namefinder -encoding UTF-8 -data
en-ner-reuters.train -detokenizer lating-detokenizer.xml > en-tok.train

Hope that helps to get you started, contribution about this to the
documentation is very welcome.

Jörn

Mail about the RC:
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201102.mbox/%[email protected]%3E


GOSS community User Group for clients. Sign-up here: 
www.gossinteractive.com/usergroup

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, 
Plymouth, PL1 1LG. Company Registration No: 3553908

This email contains proprietary information, some or all of which may be 
legally privileged. It is for the intended recipient only. If an addressing or 
transmission error has misdirected this email, please notify the author by 
replying to this email. If you are not the intended recipient you may not use, 
disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as 
information may be intercepted, corrupted, lost, destroyed, arrive late or 
incomplete or contain viruses. This email and any files attached to it have 
been checked with virus detection software before transmission. You should 
nonetheless carry out your own virus check before opening any attachment. GOSS 
Interactive Ltd accepts no liability for any loss or damage that may be caused 
by software viruses.



Reply via email to