Re: [Moses-support] German tokenizer may fail with numeric endings
Could there be a workaround with "?!$" added to the prefixes/suffixes so that if not at the end of a sentence, they will not be split? Ergun On Wed, Nov 7, 2018 at 1:20 PM Ozan Çağlayan wrote: > Hi Hieu, > > Here is it with some test cases in the commit message: > https://github.com/moses-smt/mosesdecoder/pull/204 > > Thanks. > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > -- Regards, Ergun ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] German tokenizer may fail with numeric endings
Hi Hieu, Here is it with some test cases in the commit message: https://github.com/moses-smt/mosesdecoder/pull/204 Thanks. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] German tokenizer may fail with numeric endings
I think you have a point. If you change tokenizer.perl to avoid applying non-breaking prefix to the last word, please send me the change Hieu Hoang Sent while bumping into things On Tue, 6 Nov 2018, 10:46 pm Ozan Çağlayan Yes the rules are coming from the nonbreaking_prefixes files which are > text files listing which prefixes, when preceded by a should not be > tokenized. But I think this rule should not be applied if the prefix is > actually a suffix of the sentence. Similar situations arise for French and > other languages as well. For french, "sec." is a non-breaking prefix which > is the abbreviation for "seconds" but sec also means "dry". So if a > sentence ends with the "dry" meaning of "sec." the is also not > tokenized. > > When the size of the corpora goes to infinity, this means that all > nonbreaking_prefixes for a language will end up in the model vocabulary for > NMT. > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] German tokenizer may fail with numeric endings
Funny part is trying all 1-99 :) prefix is actually a suffix of the sentence: This need not be true since there can be itemized lists. "1. one microsoft way from 9 to 1." Such sentence can be frequently found in Europarl. On Wed, Nov 7, 2018 at 1:46 AM Ozan Çağlayan wrote: > Yes the rules are coming from the nonbreaking_prefixes files which are > text files listing which prefixes, when preceded by a should not be > tokenized. But I think this rule should not be applied if the prefix is > actually a suffix of the sentence. Similar situations arise for French and > other languages as well. For french, "sec." is a non-breaking prefix which > is the abbreviation for "seconds" but sec also means "dry". So if a > sentence ends with the "dry" meaning of "sec." the is also not > tokenized. > > When the size of the corpora goes to infinity, this means that all > nonbreaking_prefixes for a language will end up in the model vocabulary for > NMT. > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > -- Regards, Ergun ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] German tokenizer may fail with numeric endings
Yes the rules are coming from the nonbreaking_prefixes files which are text files listing which prefixes, when preceded by a should not be tokenized. But I think this rule should not be applied if the prefix is actually a suffix of the sentence. Similar situations arise for French and other languages as well. For french, "sec." is a non-breaking prefix which is the abbreviation for "seconds" but sec also means "dry". So if a sentence ends with the "dry" meaning of "sec." the is also not tokenized. When the size of the corpora goes to infinity, this means that all nonbreaking_prefixes for a language will end up in the model vocabulary for NMT. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] German tokenizer may fail with numeric endings
There might be some rule that prevents. Scripts contain language specific tokenization rules and they are checked in a sequence. Did you try all 1-99? :) On Mon, Nov 5, 2018 at 9:15 PM Ozan Çağlayan wrote: > Hello, > > I just discovered that the German tokenizer does not split the final > if preceded by a number. This is because of the nonbreaking prefixes file > which lists ordinals in the form '.'. Since the list is between > 1-99, for numbers > 99, the tokenizer works correctly. Here's a sentence > from europarl: > > $ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll > die Änderungsanträge 2 und *3.*' | tokenizer.perl -q -l de > Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die > Änderungsanträge 2 und *3.* > > $ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll > die Änderungsanträge 2 und *100.*' | tokenizer.perl -q -l de > Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die > Änderungsanträge 2 und *100 .* > > > > -- > Ozan Caglayan > PhD student @ University of Le Mans > Team LST -- Language and Speech Technology > http://www.ozancaglayan.com > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > -- Regards, Ergun ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] German tokenizer may fail with numeric endings
Hello, I just discovered that the German tokenizer does not split the final if preceded by a number. This is because of the nonbreaking prefixes file which lists ordinals in the form '.'. Since the list is between 1-99, for numbers > 99, the tokenizer works correctly. Here's a sentence from europarl: $ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die Änderungsanträge 2 und *3.*' | tokenizer.perl -q -l de Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die Änderungsanträge 2 und *3.* $ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die Änderungsanträge 2 und *100.*' | tokenizer.perl -q -l de Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die Änderungsanträge 2 und *100 .* -- Ozan Caglayan PhD student @ University of Le Mans Team LST -- Language and Speech Technology http://www.ozancaglayan.com ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support