Re: Tokenizer issue - Quotation marks

Jörn Kottmann Thu, 03 Mar 2011 08:01:35 -0800

On 3/3/11 4:33 PM, Rohana Rajapakse wrote:

Thanks. I have got the training files created (conll03 + Reuters) and models trained. Used the 
latin-detokenizer that came with the download. The trained model solves the double quotation 
problem (e.g. "mistakes" now results in three tokens: ", mistakes and ").


I have tried adding the same detokenizer rules for single quote. However, it seems to conflict with the 
different usage of the single quote (e.g. possession as Tom's, It's etc.) This means we will have such 
cases separately. I will try adding<SPLIT>  tags for those cases (e.g. Tom<SPLIT>'s , 
it<SPLIT>'s  etc.). Don't know which gets the priority, rules in the detokenizer or<SPLIT>  
tags...

Yes you need to add all the tokens which should be attached to theprevious one, like "'s", "'t", etc.

It would be nice to have such a file as part of the project.

Jörn

Re: Tokenizer issue - Quotation marks

Reply via email to