Re: Tokenizer issue - Quotation marks

James Kosin Tue, 22 Feb 2011 19:51:55 -0800

On 2/22/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi,
>
>  
>
> I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
> made "mistakes".  It gives me "mistakes as a token (note starting quote
> is part of the token). But, if I change the word mistake to Mistake
> (i.e. capitol M) in the input text, then I get the token Mistakes
> (correctly). 
>
>  
>
> Anyone aware of this issue and any idea of how to get-around this?
>
>  
It looks like from the comment on the download site that the model was
trained with OpenNLP data.  This usually means it doesn't contain very
many samples.  I was able to add some samples to my (own) model and was
able to get a model that works good.


For those interested, the tokenizer usually gets passed data after the
sentence detector.  The tokenizer then breaks based on tokens or other
punctuation.  eg:
    This old house was painted white.    -- would have training data --
    This old house was painted white<SPLIT>.
This indicates we want the period at the end of the sentence to be split
from the last word in the sentence.  Similar ideas hold for the comma
and quote characters.  Special handling is required for possessive nouns
like James' would be James<SPLIT>' and John's would be John<SPLIT>'s ...
Note words in the sentence that are already separated by spaces don't
need to be <SPLIT>.

James

Re: Tokenizer issue - Quotation marks

Reply via email to