Re: Extracting Indices When Tokenizing

Jörn Kottmann Thu, 13 Sep 2012 01:09:30 -0700

Hello,

you need to use OpenNLP via its API, the tokenizer has a tokenizePos method
which returns the spans of the detected tokens.


Have a look at our documentation:
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.tokenizer.api

We do not support this in the command line interface.

Hope that helps,
Jörn



On 09/13/2012 04:26 AM, Adam Goodkind wrote:

Hi,

When tokenizing a string of text, is there also a way to track the index (of 
the original text) where the token begins?

For example:
"Mary didn't kiss John"
[(Mary, 0), (did, 5), (n't, 8), (kiss, 12), (John, 17)]

If there is a way to extract the 0, 5, 8, 12 and 17 from somewhere, that would 
be great. I cannot rely on whitespace, since the tokenizer sometimes breaks up 
words.

Thanks,
Adam

Re: Extracting Indices When Tokenizing

Reply via email to