Re: TokenizerTrainer

Andreas Niekler Thu, 14 Mar 2013 02:50:00 -0700

Hello,


> If you want to tokenize based on white spaces I suggest to use our white
> space tokenizer.

No. I do not want to tokenize on whitespaces. I found out that the
de-token.bin model isn't capable of separating things like direct speech
in texts like Er sagte, dass "die neue. This end with a token "die. So i
got a clean 300k sentences sample from our german reference corpus which
is in the form of whitespace separated tokens in one sentence per line.
I just added this one to the TokenizerTraining Tool and endet up with an
exception because of only 1 Feature found. So i added all the <SPLIT>
tags like in the documentation and the training terminated without an
error. But with the undesired errors. So i surly need a model based
tokenizer because i also want to tokenize punctuations and so on. The
only thing i wasn't able to do is the training based on whitespace
separated sentences.

Thanks for your help

Andreas

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: [email protected]

Re: TokenizerTrainer

Reply via email to