Yes I do have Reuters corpus with me. Also used Browns corpus (subset of 
Reuters?) some times ago. So, I should have that as well. Would like to know 
the steps to create a training set.

Thanks

Rohana
-----Original Message-----
From: Jörn Kottmann [mailto:[email protected]]
Sent: 23 February 2011 13:28
To: [email protected]
Subject: Re: Tokenizer issue - Quotation marks

On 2/23/11 2:18 PM, Rohana Rajapakse wrote:
> Hi James,
>
> It works for double quotes, but not for single quotes (i.e. fails for
> 'mistakes'). Is it a training issue then (not having cases with words
> enclosed within single/double quotes.
>
> I have noticed that your model file is much smaller than the model file
> available to download. Is it because your training data set is smaller?
> How does it affect tokenizing overall?
>
> Are there training sets available to download?

Yes and no, you need a file with tokenization information to train the
tokenizer.
In OpenNLP we use a sentence per line format, and the non-whitespace
separated tokens are separated by a special tag.
See our documentation for information about the format.

We observed that is is easy to use rules to detokenize correctly
tokenized text.
For that reason I implemented a rule based detokenizer.

Now you just need some kind of tokenized text to produce a training file
for our tokenizer.
You might want to use the reuters corpus, or other freely available
english language corpora.
If you have access to the reuters corpus I suggest that we go through
the steps to train the
tokenizer with it.

Jörn


GOSS community User Group for clients. Sign-up here: 
www.gossinteractive.com/usergroup

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, 
Plymouth, PL1 1LG. Company Registration No: 3553908

This email contains proprietary information, some or all of which may be 
legally privileged. It is for the intended recipient only. If an addressing or 
transmission error has misdirected this email, please notify the author by 
replying to this email. If you are not the intended recipient you may not use, 
disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as 
information may be intercepted, corrupted, lost, destroyed, arrive late or 
incomplete or contain viruses. This email and any files attached to it have 
been checked with virus detection software before transmission. You should 
nonetheless carry out your own virus check before opening any attachment. GOSS 
Interactive Ltd accepts no liability for any loss or damage that may be caused 
by software viruses.



Reply via email to