> This is also a reason to turn Unicode normalization on. If the > tokenizer did NFKC at the beginning, then the problem would go away.
If I understand the situation correctly, this would only fix this particular example and a few others like it. There are many base+combining grapheme clusters in Unicode text which cannot be normalized to a single pre-composed character. Vietnamese comes to mind. - JB On Dec 29, 2014, at 16:05 , Kenneth Heafield <mo...@kheafield.com> wrote: > Dear Moses, > > The attached file, taken from line 2345157 of > http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz > , tokenizes differently on different machines. > > I'm running tokenizer.perl from head (481a07dc) with this perl: > > This is perl 5, version 18, subversion 2 (v5.18.2) built for > x86_64-linux-thread-multi > (with 25 registered patches, see perl -V for more detail) > > perl -V is attached from newer machines. > > The input is "Jürgen" with a specific encoding: > > uconv -f utf-8 -x any-name jur > > \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING > DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL > LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>} > > So the umlaut is encoded as a normal "u" character followed by a > combining diaeresis marker. This encoding is legal, but it differs from > the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH > DIAERESIS}. > > Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING > DIAERESIS} is a single character and recognizing it as part of the > IsAlnum class. Tokenizing on these machines outputs > > Jürgen > > Newer machines are treating them separately, recognizing \N{COMBINING > DIAERESIS} as a separate character that is not part of IsAlnum. The > Moses tokenizer then treats it as something to split off, yielding this > tokenization: > > Ju ̈ rgen > > I thought it might be locale-related but IsAlnum is supposed to be > locale-agnostic. I couldn't come up with environment variables that > made the new machines tokenize as a single word. > > Maybe this is a perl bug, but the result is that two different machines > running the same perl script produce different tokenization :-(. > > This is also a reason to turn Unicode normalization on. If the > tokenizer did NFKC at the beginning, then the problem would go away. > > Kenneth > > <jur.gz><perl_V.txt>_______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support