Hello,

> We probably need to fix the detokenizer rules used for the German models
> a bit to handle these cases correctly.

Are those rules public somewhere so that i can edit them myself? I can
provide them to the community afterwars. Mostly characters like „“ are
not recognized by the tokenizer. I don't want to convert them before
tokenizing because we analyze things like direct speech and those
characters are a good indicator for that.


> I suggest to use our detokenizer to turn your tokenized text into
> training data.

Has the detokenizer a command line tool as well?

Thank you all

Andreas

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: [email protected]

Reply via email to