TokenizerTrainer

Andreas Niekler Tue, 12 Mar 2013 10:33:48 -0700

Dear List,

i created a Tokenizer Model with 300k german Sentences from a very clean
corpus. I see some words that are very strangly separated by a tokenizer
using this model like:


stehenge - blieben
fre - undlicher

and so on. I cant find those in my training data and wonder why openNLP
splits those words without any evidence in the training data and wihout
any whitespace in my text files. I trained the model with 500
Iterations, cutoff 5 and alphanumeric optimisation.

Can anyone state some ideas how i can prevent this?

thank you

Andreas
-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: [email protected]

TokenizerTrainer

Reply via email to