Hallo OpenNLPists,

We have trained a Word Tokenizer model for French on our own data and see
weird cases where spitting occurs in the middle of a word, like this

Portsmouth --> Ports mouth

This is a word from the testing corpus that is normal French text found on
the web, though the word itself is not in French.

I wonder why the word tokenizer attempts to split *between* two alphabetic
characters? I can imagine where splitting in the middle of a word can
indeed be useful, like in case of proclitics and enclitics, but I would
like to perform the latter as an additional step, making the word tokenizer
target only punctuation marks. Is it somehow configurable in OpenNLP?

Best regards,
Nikolai

Reply via email to