Re: [Moses-support] Moses tokenizer treats combining diaeresis inconsistently
So to summarize: The main issue is that the Moses tokenizer operates at the character rather than grapheme level on some versions of perl, treating combining characters (which are arguably parts of words in many cases) as non-alphanumeric and splitting them off. Older versions of perl appear to be operating at the grapheme level or internally normalizing for purposes of evaluating IsAlnum, making the tokenizer inconsistent across machines. Some graphemes, such as those in Vietnamese, do not have a single-character codepoint, so NFKC is insufficient to mask this issue. Tom doesn't want NFKC for Japanese (which the Moses tokenizer doesn't support at the moment). I still think it makes sense for the Latin alphabet. Also, there are lighter forms of canonicalization. For once, my favorite Unicode FAQ is relevant: http://www.unicode.org/faq/char_combmark.html#17 Kenneth On 12/29/2014 11:29 PM, Tom Hoar wrote: > Japanese is another language that suffers from standard Unicode NFKC > because the normalization applies changes that can not be reversed. > > > > On 12/30/2014 04:40 AM, John D Burger wrote: >>> This is also a reason to turn Unicode normalization on. If the >>> tokenizer did NFKC at the beginning, then the problem would go away. >> If I understand the situation correctly, this would only fix this particular >> example and a few others like it. There are many base+combining grapheme >> clusters in Unicode text which cannot be normalized to a single pre-composed >> character. Vietnamese comes to mind. >> >> - JB >> >> On Dec 29, 2014, at 16:05 , Kenneth Heafield wrote: >> >>> Dear Moses, >>> >>> The attached file, taken from line 2345157 of >>> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz >>> , tokenizes differently on different machines. >>> >>> I'm running tokenizer.perl from head (481a07dc) with this perl: >>> >>> This is perl 5, version 18, subversion 2 (v5.18.2) built for >>> x86_64-linux-thread-multi >>> (with 25 registered patches, see perl -V for more detail) >>> >>> perl -V is attached from newer machines. >>> >>> The input is "Jürgen" with a specific encoding: >>> >>> uconv -f utf-8 -x any-name jur >>> >>> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING >>> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL >>> LETTER E}\N{LATIN SMALL LETTER N}\N{} >>> >>> So the umlaut is encoded as a normal "u" character followed by a >>> combining diaeresis marker. This encoding is legal, but it differs from >>> the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH >>> DIAERESIS}. >>> >>> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING >>> DIAERESIS} is a single character and recognizing it as part of the >>> IsAlnum class. Tokenizing on these machines outputs >>> >>> Jürgen >>> >>> Newer machines are treating them separately, recognizing \N{COMBINING >>> DIAERESIS} as a separate character that is not part of IsAlnum. The >>> Moses tokenizer then treats it as something to split off, yielding this >>> tokenization: >>> >>> Ju ̈ rgen >>> >>> I thought it might be locale-related but IsAlnum is supposed to be >>> locale-agnostic. I couldn't come up with environment variables that >>> made the new machines tokenize as a single word. >>> >>> Maybe this is a perl bug, but the result is that two different machines >>> running the same perl script produce different tokenization :-(. >>> >>> This is also a reason to turn Unicode normalization on. If the >>> tokenizer did NFKC at the beginning, then the problem would go away. >>> >>> Kenneth >>> >>> ___ >>> Moses-support mailing list >>> Moses-support@mit.edu >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> ___ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support > > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses tokenizer treats combining diaeresis inconsistently
Japanese is another language that suffers from standard Unicode NFKC because the normalization applies changes that can not be reversed. On 12/30/2014 04:40 AM, John D Burger wrote: >> This is also a reason to turn Unicode normalization on. If the >> tokenizer did NFKC at the beginning, then the problem would go away. > If I understand the situation correctly, this would only fix this particular > example and a few others like it. There are many base+combining grapheme > clusters in Unicode text which cannot be normalized to a single pre-composed > character. Vietnamese comes to mind. > > - JB > > On Dec 29, 2014, at 16:05 , Kenneth Heafield wrote: > >> Dear Moses, >> >> The attached file, taken from line 2345157 of >> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz >> , tokenizes differently on different machines. >> >> I'm running tokenizer.perl from head (481a07dc) with this perl: >> >> This is perl 5, version 18, subversion 2 (v5.18.2) built for >> x86_64-linux-thread-multi >> (with 25 registered patches, see perl -V for more detail) >> >> perl -V is attached from newer machines. >> >> The input is "Jürgen" with a specific encoding: >> >> uconv -f utf-8 -x any-name jur >> >> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING >> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL >> LETTER E}\N{LATIN SMALL LETTER N}\N{} >> >> So the umlaut is encoded as a normal "u" character followed by a >> combining diaeresis marker. This encoding is legal, but it differs from >> the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH >> DIAERESIS}. >> >> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING >> DIAERESIS} is a single character and recognizing it as part of the >> IsAlnum class. Tokenizing on these machines outputs >> >> Jürgen >> >> Newer machines are treating them separately, recognizing \N{COMBINING >> DIAERESIS} as a separate character that is not part of IsAlnum. The >> Moses tokenizer then treats it as something to split off, yielding this >> tokenization: >> >> Ju ̈ rgen >> >> I thought it might be locale-related but IsAlnum is supposed to be >> locale-agnostic. I couldn't come up with environment variables that >> made the new machines tokenize as a single word. >> >> Maybe this is a perl bug, but the result is that two different machines >> running the same perl script produce different tokenization :-(. >> >> This is also a reason to turn Unicode normalization on. If the >> tokenizer did NFKC at the beginning, then the problem would go away. >> >> Kenneth >> >> ___ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support > > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses tokenizer treats combining diaeresis inconsistently
> This is also a reason to turn Unicode normalization on. If the > tokenizer did NFKC at the beginning, then the problem would go away. If I understand the situation correctly, this would only fix this particular example and a few others like it. There are many base+combining grapheme clusters in Unicode text which cannot be normalized to a single pre-composed character. Vietnamese comes to mind. - JB On Dec 29, 2014, at 16:05 , Kenneth Heafield wrote: > Dear Moses, > > The attached file, taken from line 2345157 of > http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz > , tokenizes differently on different machines. > > I'm running tokenizer.perl from head (481a07dc) with this perl: > > This is perl 5, version 18, subversion 2 (v5.18.2) built for > x86_64-linux-thread-multi > (with 25 registered patches, see perl -V for more detail) > > perl -V is attached from newer machines. > > The input is "Jürgen" with a specific encoding: > > uconv -f utf-8 -x any-name jur > > \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING > DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL > LETTER E}\N{LATIN SMALL LETTER N}\N{} > > So the umlaut is encoded as a normal "u" character followed by a > combining diaeresis marker. This encoding is legal, but it differs from > the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH > DIAERESIS}. > > Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING > DIAERESIS} is a single character and recognizing it as part of the > IsAlnum class. Tokenizing on these machines outputs > > Jürgen > > Newer machines are treating them separately, recognizing \N{COMBINING > DIAERESIS} as a separate character that is not part of IsAlnum. The > Moses tokenizer then treats it as something to split off, yielding this > tokenization: > > Ju ̈ rgen > > I thought it might be locale-related but IsAlnum is supposed to be > locale-agnostic. I couldn't come up with environment variables that > made the new machines tokenize as a single word. > > Maybe this is a perl bug, but the result is that two different machines > running the same perl script produce different tokenization :-(. > > This is also a reason to turn Unicode normalization on. If the > tokenizer did NFKC at the beginning, then the problem would go away. > > Kenneth > > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support