Hallo OpenNLPists, We have trained a Word Tokenizer model for French on our own data and see weird cases where spitting occurs in the middle of a word, like this
Portsmouth --> Ports mouth This is a word from the testing corpus that is normal French text found on the web, though the word itself is not in French. I wonder why the word tokenizer attempts to split *between* two alphabetic characters? I can imagine where splitting in the middle of a word can indeed be useful, like in case of proclitics and enclitics, but I would like to perform the latter as an additional step, making the word tokenizer target only punctuation marks. Is it somehow configurable in OpenNLP? Best regards, Nikolai
