Hi,

Perhaps you alredy know this, but just in case.
Tokenizing/detokenizing is heavily used in statistical machine
translation to prepare training/evaluation data. There are rule-based
detokenizers developed by the Moses developers, including a German
detokenizer (the Moses research leader is German). I cannot say how
well it works for German, perhaps the Apache OpenNLP detokenizer works
better, but I have used them heavily for English and Spanish and works
fine in those languages.

https://github.com/moses-smt/mosesdecoder/tree/master/scripts

Look for the tokenizer and share folders.

Cheers,

Rodrigo

On Fri, Mar 15, 2013 at 9:44 AM, Jörn Kottmann <[email protected]> wrote:
> On 03/15/2013 02:42 AM, James Kosin wrote:
>>
>>
>> Here, each token is separated by a space in the final output. What you
>> seem to have is data that is already tokenized and you are trying to
>> generate a training file on that data.  It isn't impossible but... nothing
>> you do can get a perfect output of the original without the original data.
>>
>> There are some rules that do work, but... not always.
>
>
> We historically always did it like that, because all the corpora we trained
> on only have tokenized text and therefore need
> to be detokenized somehow to produce training data for the tokenizer.
>
> Jörn

Reply via email to