Re: [Moses-support] German tokenizer may fail with numeric endings

2018-11-07 Thread Ergun Bicici
Could there be a workaround with "?!$" added to the prefixes/suffixes so
that if not at the end of a sentence, they will not be split?

Ergun

On Wed, Nov 7, 2018 at 1:20 PM Ozan Çağlayan  wrote:

> Hi Hieu,
>
> Here is it with some test cases in the commit message:
> https://github.com/moses-smt/mosesdecoder/pull/204
>
> Thanks.
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] German tokenizer may fail with numeric endings

2018-11-07 Thread Ozan Çağlayan
Hi Hieu,

Here is it with some test cases in the commit message:
https://github.com/moses-smt/mosesdecoder/pull/204

Thanks.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] German tokenizer may fail with numeric endings

2018-11-07 Thread Hieu Hoang
I think you have a point. If you change tokenizer.perl to avoid applying
non-breaking prefix to the last word, please send me the change

Hieu Hoang
Sent while bumping into things

On Tue, 6 Nov 2018, 10:46 pm Ozan Çağlayan  Yes the rules are coming from the nonbreaking_prefixes files which are
> text files listing which prefixes, when preceded by a  should not be
> tokenized. But I think this rule should not be applied if the prefix is
> actually a suffix of the sentence. Similar situations arise for French and
> other languages as well. For french, "sec." is a non-breaking prefix which
> is the abbreviation for "seconds" but sec also means "dry". So if a
> sentence ends with the "dry" meaning of "sec." the  is also not
> tokenized.
>
> When the size of the corpora goes to infinity, this means that all
> nonbreaking_prefixes for a language will end up in the model vocabulary for
> NMT.
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] German tokenizer may fail with numeric endings

2018-11-06 Thread Ergun Bicici
Funny part is trying all 1-99 :)

prefix is actually a suffix of the sentence: This need not be true since
there can be itemized lists. "1. one microsoft way from 9 to 1." Such
sentence can be frequently found in Europarl.

On Wed, Nov 7, 2018 at 1:46 AM Ozan Çağlayan  wrote:

> Yes the rules are coming from the nonbreaking_prefixes files which are
> text files listing which prefixes, when preceded by a  should not be
> tokenized. But I think this rule should not be applied if the prefix is
> actually a suffix of the sentence. Similar situations arise for French and
> other languages as well. For french, "sec." is a non-breaking prefix which
> is the abbreviation for "seconds" but sec also means "dry". So if a
> sentence ends with the "dry" meaning of "sec." the  is also not
> tokenized.
>
> When the size of the corpora goes to infinity, this means that all
> nonbreaking_prefixes for a language will end up in the model vocabulary for
> NMT.
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] German tokenizer may fail with numeric endings

2018-11-06 Thread Ozan Çağlayan
Yes the rules are coming from the nonbreaking_prefixes files which are text
files listing which prefixes, when preceded by a  should not be
tokenized. But I think this rule should not be applied if the prefix is
actually a suffix of the sentence. Similar situations arise for French and
other languages as well. For french, "sec." is a non-breaking prefix which
is the abbreviation for "seconds" but sec also means "dry". So if a
sentence ends with the "dry" meaning of "sec." the  is also not
tokenized.

When the size of the corpora goes to infinity, this means that all
nonbreaking_prefixes for a language will end up in the model vocabulary for
NMT.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] German tokenizer may fail with numeric endings

2018-11-06 Thread Ergun Bicici
There might be some rule that prevents. Scripts contain language specific
tokenization rules and they are checked in a sequence.

Did you try all 1-99? :)

On Mon, Nov 5, 2018 at 9:15 PM Ozan Çağlayan  wrote:

> Hello,
>
> I just discovered that the German tokenizer does not split the final 
> if preceded by a number. This is because of the nonbreaking prefixes file
> which lists ordinals in the form '.'. Since the list is between
> 1-99, for numbers > 99, the tokenizer works correctly. Here's a sentence
> from europarl:
>
> $ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll
> die Änderungsanträge 2 und *3.*' | tokenizer.perl -q -l de
> Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
> Änderungsanträge 2 und *3.*
>
> $ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll
> die Änderungsanträge 2 und *100.*' | tokenizer.perl -q -l de
> Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
> Änderungsanträge 2 und *100 .*
>
>
>
> --
> Ozan Caglayan
> PhD student @ University of Le Mans
> Team LST -- Language and Speech Technology
> http://www.ozancaglayan.com
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] German tokenizer may fail with numeric endings

2018-11-05 Thread Ozan Çağlayan
Hello,

I just discovered that the German tokenizer does not split the final 
if preceded by a number. This is because of the nonbreaking prefixes file
which lists ordinals in the form '.'. Since the list is between
1-99, for numbers > 99, the tokenizer works correctly. Here's a sentence
from europarl:

$ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
Änderungsanträge 2 und *3.*' | tokenizer.perl -q -l de
Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
Änderungsanträge 2 und *3.*

$ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
Änderungsanträge 2 und *100.*' | tokenizer.perl -q -l de
Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
Änderungsanträge 2 und *100 .*



-- 
Ozan Caglayan
PhD student @ University of Le Mans
Team LST -- Language and Speech Technology
http://www.ozancaglayan.com
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support