My NameFinder training model (created from CONLL + Reuters) has <START> and 
<END> markups for person names. It doesn't have <SPLIT> markups. I am trying 
the testTokenizer() test with TokenizerTestUtil.createMaxentTokenModel() to 
create a model using my training data file. I had to remove <START> and <END> 
tags and add few <SPLIT> tags to get the test to work (to get "Number of 
Outcomes" to match). It learns a model now, but not perfect. I need to add 
<SPLIT> markups for all single and double quotes etc.

By the way, where is the " TokenizerConverter" that you had mentioned. My 
download (from sourceforge) doesn't have it. Also, where is the converter to 
produce name
Finder that you have created to convert CONLL03. Am I missing some code in my 
download.

Also, please point me to your "docbook". Would like to know more about the 
detokenizer. I can't find a "release candidate" in the download site.

Thanks


Rohana Rajapakse
Senior Software Developer
GOSS Interactive

t:  +44 (0)844 880 3637
f:  +44 (0)844 880 3638
e: [email protected]
w: www.gossinteractive.com

-----Original Message-----
From: Rohana Rajapakse [mailto:[email protected]]
Sent: 24 February 2011 11:10
To: [email protected]
Subject: RE: Tokenizer issue - Quotation marks

Thanks a lot.

I did create a name finder training data file using CONNEL sometimes ago. Will 
have a look at how I did it. I may be able to convert this training file to 
produce a tokenizer training file.

Thanks a lot. Will let you know how I get on with this. Would contribute 
anything that might be useful to others.

Rohana


-----Original Message-----
From: Jörn Kottmann [mailto:[email protected]]
Sent: 24 February 2011 10:59
To: [email protected]
Subject: Re: Tokenizer issue - Quotation marks

On 2/23/11 2:33 PM, Rohana Rajapakse wrote:
> Yes I do have Reuters corpus with me. Also used Browns corpus (subset of 
> Reuters?) some times ago. So, I should have that as well. Would like to know 
> the steps to create a training set.
>
For CONLL03 (uses Reuters data) we created a converter to produce name
finder
training material. The process how to do that is described in our
docbook (just build
it or download the release candidate).

After you produced the name finder training file you can use the
converter for
the tokenizer and a detokenizer file to produce training data for the
tokenizer.
There is a sample detokenizer file which maybe must be extended a little
for english,
maybe we should create a new folder within the tools project to collect
all the
non-statistical model files. I guess a good detokenizer for the reuters
corpus could be
useful for others too.

That can be done with a command like this
bin/opennlp TokenizerConverter namefinder -encoding UTF-8 -data
en-ner-reuters.train -detokenizer lating-detokenizer.xml > en-tok.train

Hope that helps to get you started, contribution about this to the
documentation is very welcome.

Jörn

Mail about the RC:
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201102.mbox/%[email protected]%3E


GOSS community User Group for clients. Sign-up here: 
www.gossinteractive.com/usergroup

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, 
Plymouth, PL1 1LG. Company Registration No: 3553908

This email contains proprietary information, some or all of which may be 
legally privileged. It is for the intended recipient only. If an addressing or 
transmission error has misdirected this email, please notify the author by 
replying to this email. If you are not the intended recipient you may not use, 
disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as 
information may be intercepted, corrupted, lost, destroyed, arrive late or 
incomplete or contain viruses. This email and any files attached to it have 
been checked with virus detection software before transmission. You should 
nonetheless carry out your own virus check before opening any attachment. GOSS 
Interactive Ltd accepts no liability for any loss or damage that may be caused 
by software viruses.



Reply via email to