Thank you, what i don't understand is how this is producing valid training data since i just delete whitespaces. You said that i need to include some <SPLIT> Tags to have proper training data. Can you please comment on the fact why we have proper training data after detokenizing. I hope that it's ok to ask all these querstions but i really want to understand openNLP tokenisation.
Thank you very much and i will create a detokenizer dict based on all relevant special characters contained in my 300k dataset. Andreas Am 14.03.2013 11:34, schrieb Jörn Kottmann: > On 03/14/2013 11:27 AM, Andreas Niekler wrote: >> Hello, >> >>> We probably need to fix the detokenizer rules used for the German models >>> a bit to handle these cases correctly. >> Are those rules public somewhere so that i can edit them myself? I can >> provide them to the community afterwars. Mostly characters like „“ are >> not recognized by the tokenizer. I don't want to convert them before >> tokenizing because we analyze things like direct speech and those >> characters are a good indicator for that. > > No for the German models I wrote some code to do the detokenization > which supported > a specific corpus. Anyway, this work then lead me to contribute the > detokenizer to OpenNLP. > > There is one file for English: > https://github.com/apache/opennlp/tree/trunk/opennlp-tools/lang/en/tokenizer > > > We would be happy to receive a contribution for German. > Have a look at the documentation, there is a section about the detokenizer. > >>> I suggest to use our detokenizer to turn your tokenized text into >>> training data. >> Has the detokenizer a command line tool as well? >> > > Yes, there is one. Have a look at the CLI help. > > Jörn > -- Andreas Niekler, Dipl. Ing. (FH) NLP Group | Department of Computer Science University of Leipzig Johannisgasse 26 | 04103 Leipzig mail: [email protected]
