> This is also a reason to turn Unicode normalization on.  If the
> tokenizer did NFKC at the beginning, then the problem would go away.

If I understand the situation correctly, this would only fix this particular 
example and a few others like it. There are many base+combining grapheme 
clusters in Unicode text which cannot be normalized to a single pre-composed 
character. Vietnamese comes to mind.

- JB

On Dec 29, 2014, at 16:05 , Kenneth Heafield <mo...@kheafield.com> wrote:

> Dear Moses,
> 
>       The attached file, taken from line 2345157 of
> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
> , tokenizes differently on different machines.
> 
>       I'm running tokenizer.perl from head (481a07dc) with this perl:
> 
> This is perl 5, version 18, subversion 2 (v5.18.2) built for
> x86_64-linux-thread-multi
> (with 25 registered patches, see perl -V for more detail)
> 
> perl -V is attached from newer machines.
> 
>       The input is "Jürgen" with a specific encoding:
> 
> uconv -f utf-8 -x any-name jur
> 
> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
> LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
> 
> So the umlaut is encoded as a normal "u" character followed by a
> combining diaeresis marker.  This encoding is legal, but it differs from
> the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
> DIAERESIS}.
> 
> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
> DIAERESIS} is a single character and recognizing it as part of the
> IsAlnum class.  Tokenizing on these machines outputs
> 
> Jürgen
> 
> Newer machines are treating them separately, recognizing \N{COMBINING
> DIAERESIS} as a separate character that is not part of IsAlnum.  The
> Moses tokenizer then treats it as something to split off, yielding this
> tokenization:
> 
> Ju ̈ rgen
> 
> I thought it might be locale-related but IsAlnum is supposed to be
> locale-agnostic.  I couldn't come up with environment variables that
> made the new machines tokenize as a single word.
> 
> Maybe this is a perl bug, but the result is that two different machines
> running the same perl script produce different tokenization :-(.
> 
> This is also a reason to turn Unicode normalization on.  If the
> tokenizer did NFKC at the beginning, then the problem would go away.
> 
> Kenneth
> 
> <jur.gz><perl_V.txt>_______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to