Dear List, i created a Tokenizer Model with 300k german Sentences from a very clean corpus. I see some words that are very strangly separated by a tokenizer using this model like:
stehenge - blieben fre - undlicher and so on. I cant find those in my training data and wonder why openNLP splits those words without any evidence in the training data and wihout any whitespace in my text files. I trained the model with 500 Iterations, cutoff 5 and alphanumeric optimisation. Can anyone state some ideas how i can prevent this? thank you Andreas -- Andreas Niekler, Dipl. Ing. (FH) NLP Group | Department of Computer Science University of Leipzig Johannisgasse 26 | 04103 Leipzig mail: [email protected]
