Hello,
> If you want to tokenize based on white spaces I suggest to use our white > space tokenizer. No. I do not want to tokenize on whitespaces. I found out that the de-token.bin model isn't capable of separating things like direct speech in texts like Er sagte, dass "die neue. This end with a token "die. So i got a clean 300k sentences sample from our german reference corpus which is in the form of whitespace separated tokens in one sentence per line. I just added this one to the TokenizerTraining Tool and endet up with an exception because of only 1 Feature found. So i added all the <SPLIT> tags like in the documentation and the training terminated without an error. But with the undesired errors. So i surly need a model based tokenizer because i also want to tokenize punctuations and so on. The only thing i wasn't able to do is the training based on whitespace separated sentences. Thanks for your help Andreas -- Andreas Niekler, Dipl. Ing. (FH) NLP Group | Department of Computer Science University of Leipzig Johannisgasse 26 | 04103 Leipzig mail: [email protected]
