Re: Training a tokenizer that doesn't tokenize automatically on spaces

Joern Kottmann Mon, 13 Feb 2012 05:19:19 -0800

The tokenizer assumes it can always split on white spaces. So it will not
work without
modifying this code.

You could hack it by replacing all whitespaces with a special character in
your training and test
data.

For which language do you need that?

Jörn

On Sat, Feb 11, 2012 at 6:46 PM, Lee Hinman <[email protected]>wrote:

> Hey Guys,
>
> I'm trying to train a tokenizer that ignores spaces and only uses <SPLIT>
> to determine where to split. I wasn't able to find anything in the
> javadocs, is this possible with OpenNLP? If so, could someone point me in
> the right direction regarding it?
>
> - Lee Hinman

Re: Training a tokenizer that doesn't tokenize automatically on spaces

Reply via email to