Re: [Moses-support] Moses tokenizer treats combining diaeresis inconsistently

Kenneth Heafield Mon, 29 Dec 2014 21:40:20 -0800

So to summarize:

The main issue is that the Moses tokenizer operates at the character
rather than grapheme level on some versions of perl, treating combining
characters (which are arguably parts of words in many cases) as
non-alphanumeric and splitting them off.


Older versions of perl appear to be operating at the grapheme level or
internally normalizing for purposes of evaluating IsAlnum, making the
tokenizer inconsistent across machines.

Some graphemes, such as those in Vietnamese, do not have a
single-character codepoint, so NFKC is insufficient to mask this issue.

Tom doesn't want NFKC for Japanese (which the Moses tokenizer doesn't
support at the moment).  I still think it makes sense for the Latin
alphabet.  Also, there are lighter forms of canonicalization.

For once, my favorite Unicode FAQ is relevant:
http://www.unicode.org/faq/char_combmark.html#17

Kenneth

On 12/29/2014 11:29 PM, Tom Hoar wrote:
> Japanese is another language that suffers from standard Unicode NFKC 
> because the normalization applies changes that can not be reversed.
> 
> 
> 
> On 12/30/2014 04:40 AM, John D Burger wrote:
>>> This is also a reason to turn Unicode normalization on.  If the
>>> tokenizer did NFKC at the beginning, then the problem would go away.
>> If I understand the situation correctly, this would only fix this particular 
>> example and a few others like it. There are many base+combining grapheme 
>> clusters in Unicode text which cannot be normalized to a single pre-composed 
>> character. Vietnamese comes to mind.
>>
>> - JB
>>
>> On Dec 29, 2014, at 16:05 , Kenneth Heafield <mo...@kheafield.com> wrote:
>>
>>> Dear Moses,
>>>
>>>     The attached file, taken from line 2345157 of
>>> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
>>> , tokenizes differently on different machines.
>>>
>>>     I'm running tokenizer.perl from head (481a07dc) with this perl:
>>>
>>> This is perl 5, version 18, subversion 2 (v5.18.2) built for
>>> x86_64-linux-thread-multi
>>> (with 25 registered patches, see perl -V for more detail)
>>>
>>> perl -V is attached from newer machines.
>>>
>>>     The input is "Jürgen" with a specific encoding:
>>>
>>> uconv -f utf-8 -x any-name jur
>>>
>>> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
>>> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
>>> LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
>>>
>>> So the umlaut is encoded as a normal "u" character followed by a
>>> combining diaeresis marker.  This encoding is legal, but it differs from
>>> the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
>>> DIAERESIS}.
>>>
>>> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
>>> DIAERESIS} is a single character and recognizing it as part of the
>>> IsAlnum class.  Tokenizing on these machines outputs
>>>
>>> Jürgen
>>>
>>> Newer machines are treating them separately, recognizing \N{COMBINING
>>> DIAERESIS} as a separate character that is not part of IsAlnum.  The
>>> Moses tokenizer then treats it as something to split off, yielding this
>>> tokenization:
>>>
>>> Ju ̈ rgen
>>>
>>> I thought it might be locale-related but IsAlnum is supposed to be
>>> locale-agnostic.  I couldn't come up with environment variables that
>>> made the new machines tokenize as a single word.
>>>
>>> Maybe this is a perl bug, but the result is that two different machines
>>> running the same perl script produce different tokenization :-(.
>>>
>>> This is also a reason to turn Unicode normalization on.  If the
>>> tokenizer did NFKC at the beginning, then the problem would go away.
>>>
>>> Kenneth
>>>
>>> <jur.gz><perl_V.txt>_______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Moses tokenizer treats combining diaeresis inconsistently

Reply via email to