Re: Tokenizer issue - Quotation marks

Jörn Kottmann Wed, 23 Feb 2011 05:28:22 -0800

On 2/23/11 2:18 PM, Rohana Rajapakse wrote:

Hi James,


It works for double quotes, but not for single quotes (i.e. fails for
'mistakes'). Is it a training issue then (not having cases with words
enclosed within single/double quotes.

I have noticed that your model file is much smaller than the model file
available to download. Is it because your training data set is smaller?
How does it affect tokenizing overall?

Are there training sets available to download?

Yes and no, you need a file with tokenization information to train thetokenizer.

In OpenNLP we use a sentence per line format, and the non-whitespace
separated tokens are separated by a special tag.
See our documentation for information about the format.

We observed that is is easy to use rules to detokenize correctlytokenized text.

For that reason I implemented a rule based detokenizer.

Now you just need some kind of tokenized text to produce a training filefor our tokenizer.You might want to use the reuters corpus, or other freely availableenglish language corpora.If you have access to the reuters corpus I suggest that we go throughthe steps to train the

tokenizer with it.

Jörn

Re: Tokenizer issue - Quotation marks

Reply via email to