On 3/3/11 4:33 PM, Rohana Rajapakse wrote:
Thanks. I have got the training files created (conll03 + Reuters) and models trained. Used the
latin-detokenizer that came with the download. The trained model solves the double quotation
problem (e.g. "mistakes" now results in three tokens: ", mistakes and ").
I have tried adding the same detokenizer rules for single quote. However, it seems to conflict with the
different usage of the single quote (e.g. possession as Tom's, It's etc.) This means we will have such
cases separately. I will try adding<SPLIT> tags for those cases (e.g. Tom<SPLIT>'s ,
it<SPLIT>'s etc.). Don't know which gets the priority, rules in the detokenizer or<SPLIT>
tags...
Yes you need to add all the tokens which should be attached to the
previous one, like "'s", "'t", etc.
It would be nice to have such a file as part of the project.
Jörn