Hi, Perhaps you alredy know this, but just in case. Tokenizing/detokenizing is heavily used in statistical machine translation to prepare training/evaluation data. There are rule-based detokenizers developed by the Moses developers, including a German detokenizer (the Moses research leader is German). I cannot say how well it works for German, perhaps the Apache OpenNLP detokenizer works better, but I have used them heavily for English and Spanish and works fine in those languages.
https://github.com/moses-smt/mosesdecoder/tree/master/scripts Look for the tokenizer and share folders. Cheers, Rodrigo On Fri, Mar 15, 2013 at 9:44 AM, Jörn Kottmann <[email protected]> wrote: > On 03/15/2013 02:42 AM, James Kosin wrote: >> >> >> Here, each token is separated by a space in the final output. What you >> seem to have is data that is already tokenized and you are trying to >> generate a training file on that data. It isn't impossible but... nothing >> you do can get a perfect output of the original without the original data. >> >> There are some rules that do work, but... not always. > > > We historically always did it like that, because all the corpora we trained > on only have tokenized text and therefore need > to be detokenized somehow to produce training data for the tokenizer. > > Jörn
