[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Tom Christiansen Sun, 14 Aug 2011 23:53:44 -0700

Tom Christiansen <tchr...@perl.com> added the comment:

Ezio Melotti <rep...@bugs.python.org> wrote on Mon, 15 Aug 2011 04:56:55 -0000:


> Another thing I noticed is that (at least on wide builds) surrogate pairs are 
> not joined "on the fly":
> >>> p
> '\ud800\udc00'
> >>> len(p)
> 2
> >>> p.encode('utf-16').decode('utf-16')
> '𐀀'
> >>> len(_)
> 1

    (For those who may not immediately realize from reading the surrogates,
     '𐀀' is code point 0x10000, the first non-BMP code point.  I piped it 
     through `uniquote -x` just to make sure.)

Yes, that makes perfect sense.  It's something of a buggy feature or featureful 
bug
that UTF-16 does this.  

When you are thinking of arbitrary sequences of code points, which is
something you have be able to do in memory but not in a UTF stream, then
one can say that one has four code points of anything in the 0 .. 0x10FFFF
range.  Those can be any arbitrary code points only (1) *while* in memory,
*and* assuming a (2) non-UTF16, ie UTF-32 or UTF-8 representation.  You
cannot do that with UTF-16, which is why it works only on a Python wide
build.  Otherwise they join up.

The reason they "join up" in UTF-16 is also the reason why unlike in regular
memory where you might be able to use an alternate representation like UTF-8 or
UTF-32, UTF streams cannot contain unpaired surrogates: because if that stream
were in UTF-16, you would never be able to tell the difference between a
sequence of a lead surrogate followed by a tail surrogate and the same thing
meaning just one non-BMP code point.  Since you would not be able to tell the
difference, it always only means the latter, and the former sense is illegal.
This is why lone surrogates are illegal in UTF streams.

In case it isn't obvious, *this* is the source of the [𝒜--𝒵] bug in all
the UTF-16 or UCS-2 regex languages. It is why Java 7 added \x{...}, so
that they can rewrite that as [\x{1D49C}--\x{1D4B5}] to pass the regex
compiler, so that it seems something indirect, not just surrogates.

That's why I always check it in my cross-language regex tests.  A 16-bit
language has to have a workaround, somehow, or it will be in trouble.

The Java regex compiler doesn't generate UTF-16 for itself, either. It
generates UTF-32 for its pattern.  You can see this right at the start of
the source code.  This is from the Java Pattern class:

    /**
     * Copies regular expression to an int array and invokes the parsing
     * of the expression which will create the object tree.
     */
    private void compile() {
        // Handle canonical equivalences
        if (has(CANON_EQ) && !has(LITERAL)) {
            normalize();
        } else {
            normalizedPattern = pattern;
        }
        patternLength = normalizedPattern.length();

        // Copy pattern to int array for convenience
        // Use double zero to terminate pattern
        temp = new int[patternLength + 2];

        hasSupplementary = false;
        int c, count = 0;
        // Convert all chars into code points
        for (int x = 0; x < patternLength; x += Character.charCount(c)) {
            c = normalizedPattern.codePointAt(x);
            if (isSupplementary(c)) {
                hasSupplementary = true;
            }
            temp[count++] = c;
        }

        patternLength = count;   // patternLength now in code points

See how that works?  They use an int(-32) array, not a char(-16) array!  It's 
reasonably
clever, and necessary.  Because it does that, it can now compile \x{1D49C} or 
erstwhile
embedded UTF-8 non-BMP literals into UTF-32, and not get upset by the stormy 
sea of
troubles that surrogates are. You can't have surrogates in ranges if you don't 
do
something like this in a 16-bit language.

Java couldn't fix the [𝒜--𝒵] bug except by doing the \x{...} indirection trick,
because they are stuck with UTF-16.  However, they actually can match the string
"𝒜" against the pattern "^.$", and have it fail on "^..$".   Yes, I know: the
code-unit length of that string is 2, but its regex count is just one dot worth.

I *believe* they did it that way because tr18 says it has to work that way, but
they may also have done it just because it makes sense.  My current contact at
Oracle doing regex support is not the guy who originally wrote the class, so I
am not sure.  (He's very good, BTW.  For Java 7, he also added named captures,
script properties, *and* brought the class up to conformance with tr18's 
"level 1" requirements.)

I'm thinking Python might be able to do in the regex engine on narrow builds 
the 
sort of thing that Java does.  However, I am also thinking that that might
be a lot of work for a situation more readily addressed by phasing out narrow
builds or at least telling people they should use wide builds to get that thing
to work.  

--tom

======================================================================

        ============================================
        ===>  QUASI OFF TOPIC ADDENDUM FOLLOWS  <===
        ============================================

The rest of this message is just a comparison demo of how Perl treats the
same sort of surrogate stuff that you were showing with Python. I'm not
sure that it as relevant to your problem as the Java part just was, since
we don't have UTF-16 the way Java does.  

One place it *may* be relevant to Python, which is the only reason I'm
including it, is that it shows what sort of elastic boundaries you have
with the old loose form utf8 that you no longer have with the current
strict definition of UTF-8 from the Unicode Standard.  

It turns out that you can't play those surrogate halvsies games in Perl,
because it thinks you're nuts. :)

   (Whether this is a bug or a feature may depend on where you're standing, but
    we're trying to conform with our reading of the Standard.  Things like the
    Unicode Collation Algorithm have third-party conformance suites, but I
    don't know of any for encoders/decoders, so one is back to deciding whether
    one has correctly understood the Standard.  As we all know, this isn't
    always all that easy.)

This is the clincher at the far end, so you can quit reading now if you want:

    % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}")' | uniquote -b 
    UTF-16 surrogate U+D800 in subroutine entry at 
/usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00

    % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}\x{DC00}")' | 
uniquote -b
    UTF-16 surrogate U+D800 in subroutine entry at 
/usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    UTF-16 surrogate U+DC00 in subroutine entry at 
/usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00\x00\x00

    % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}"), 
encode("UTF-16LE", "\x{DC00}")' | uniquote -b 
    UTF-16 surrogate U+D800 in subroutine entry at 
/usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    UTF-16 surrogate U+DC00 in subroutine entry at 
/usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00\x00\x00

(Meanwhile, back to the quasi-off-topic monologue.)

Oh, you can have isolated surrogates in memory just fine.  We use abstract
code points not code units, and since we aren't UTF-16, there's no problem 
there:

    % perl -le 'print length("\x{D800}")'
    1
    % perl -le 'print length("\x{D800}\x{DC00}")'
    2
    % perl -le 'print length("\x{DC00}\x{D800}")'
    2

But you aren't allowed to *encode* those abstract characters into a UTF 
encoding form.  

        (The -CS switch is like having PERLUNICODE=S or PYTHONIOENCODING=utf8)

    % perl -CS -le 'print "\x{D800}\x{DC00}"'
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1.
    ??????

Now, you might think my encoder did that, but it didn't.  Those "??????" are is 
from my
dumb Mac Terminal, *not* from the Perl utf8 encoder.  This is yet another 
reason, BTW, 
why using "?" for the replacement for a non-encodable code point is strongly 
discouraged 
by Unicode.  Otherwise it is too hard whether it is a "?" because it was in the 
original
data or a "?" because it was a replacment char.  Some of our encoders do use

     �  FFFD        REPLACEMENT CHARACTER
        * used to replace an incoming character whose value is unknown or 
unrepresentable in Unicode
        * compare the use of 001A as a control character to indicate the 
substitute function

But that isn't triggering here.  Run through uniquote shows that you got
illegal surrogates in your (loose) utf8 stream:

    % perl -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -x
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1.
    \x{D800}\x{DC00}

Use -b on uniquote to get bytes not hex chars:

    % perl -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -b
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1.
    \xED\xA0\x80\xED\xB0\x80

Yes, I'm not fond of the dribbled-out warning, either.  It should of course be
an exception.  Part of what is going on here is that Perl's regular "utf8"
encoding is the (I believe) the same loose one that Python also uses, the old
one from before the restrictions. If you use the strict "UTF-8", then you get...

    % perl -le 'binmode(STDOUT, "encoding(UTF-8)"); print "\x{D800}\x{DC00}"' 
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1.
    "\x{d800}" does not map to utf8.
    "\x{dc00}" does not map to utf8.
    \x{D800}\x{DC00}

Which is pretty whacky.  I'm *not* uniquoting that this time around, the
encoder is.   It did the \x{} substitution because it is absolutely
forbidden from ever emitting surrogates in a valid strict UTF-8 stream,
the way it will begrudgingly do in a loose utf8 stream (although the
result is not valid as strict UTF-8).

If you run with all warnings promoted into exceptions (as I almost always do),
then you get more reasonable behavior, maybe.   I'm going to use a command-line
switch that's just like saying

    use warnings "FATAL" => "all".

in the top-level scope.  First with loose utf8:

    % perl -Mwarnings=FATAL,all -CS -le 'print "\x{D800}\x{DC00}"' 
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    Exit 255

And next with strict UTF-8:

    % perl -Mwarnings=FATAL,all -le 'binmode(STDOUT, "encoding(UTF-8)"); print 
"\x{D800}\x{DC00}"' 
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    Exit 25

So when you make them fatal, you get an exception -- which by not being caught,
causes the program to exit.  

    (The "Exit" is printed by my shell, and no, I'm sorry but I don't
     know why the status value differs.  Well, kinda I know.  It's
     because errno was 25 (ENOTTY) in the second case, but had been 0 in
     the first case so that turned into an unsigned char -1.)

If you *promise* Perl you know what you're doing and full speed ahead, you can
disable utf8 warnings, which will work to get you your two surrogates in the
outbound *loose* utf8 stream, as six bytes that look like UTF-8 but actually
"can't" be (called that) due to the Standard's rules:

    % perl -M-warnings=utf8  -CS -le 'print "\x{D800}\x{DC00}"'
    ??????
    % perl -M-warnings=utf8 -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -x
    \x{D800}\x{DC00}
    % perl -M-warnings=utf8 -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -b
    \xED\xA0\x80\xED\xB0\x80

    (The weird module import is like saying 
        no warnings "utf8";
    as the outer scope.)

You can't get it with strict UTF-8 no matter how hard you try though.  Only
the loose one will put out invalid UTF-8.

So what about UTF-16?  Can we build things up piecemeal?  Turns out that
no, we can't.  I *think* this is a feature, but I'm not 100.000% sure.

    % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}")' | uniquote -b 
    UTF-16 surrogate U+D800 in subroutine entry at 
/usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00

    % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}\x{DC00}")' | 
uniquote -b
    UTF-16 surrogate U+D800 in subroutine entry at 
/usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    UTF-16 surrogate U+DC00 in subroutine entry at 
/usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00\x00\x00

    % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}"), 
encode("UTF-16LE", "\x{DC00}")' | uniquote -b 
    UTF-16 surrogate U+D800 in subroutine entry at 
/usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    UTF-16 surrogate U+DC00 in subroutine entry at 
/usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00\x00\x00

You get all nulls because it refused to give you anything for surrogates.
Although you can force it for UTF-8 if you use the loose utf8 version that
is out of spec, there is just nothing you can to do encode into UTF-16 
code points that are surrogates.  Just can't.  There is no loose UTF-16.

Or that's what it looks like to me.  You can have surrogates in memory, but
those cannot make into a UTF-16 stream.  Or a UTF-32 stream either.

    % perl -MEncode -le 'print encode("UTF-32LE", "\x{D800}")' | uniquote -b
    UTF-16 surrogate U+D800 in subroutine entry at 
/usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00\x00\x00

You have to encode integral code points, not halvsies:

    % perl -C0 -MEncode -le 'print encode("UTF-16BE", "\x{10000}")' | uniquote 
-b
    \xD8\x00\xDC\x00
    % perl -C0 -MEncode -le 'print encode("UTF-16LE", "\x{10000}")' | uniquote 
-b
    \x00\xD8\x00\xDC
    % perl -C0 -MEncode -le 'print encode("UTF-16",   "\x{10000}")' | uniquote 
-b
    \xFE\xFF\xD8\x00\xDC\x00

    % perl -C0 -MEncode -le 'print encode("UTF-32BE", "\x{10000}")' | uniquote 
-b
    \x00\x01\x00\x00
    % perl -C0 -MEncode -le 'print encode("UTF-32LE", "\x{10000}")' | uniquote 
-b
    \x00\x00\x01\x00
    % perl -C0 -MEncode -le 'print encode("UTF-32",   "\x{10000}")' | uniquote 
-b
    \x00\x00\xFE\xFF\x00\x01\x00\x00

    % perl -C0 -MEncode -le 'print encode("UTF-8",    "\x{10000}")' | uniquote 
-b
    \xF0\x90\x80\x80
    % perl -C0 -MEncode -le 'print encode("UTF-8",    "\x{10000}")' | uniquote 
-x
    \x{10000}

    % perl -C0 -MEncode -le 'print encode("UTF-8",    "\x{10000}")' 
    𐀀
    % perl -CS          -le 'print                    "\x{10000}" ' 
    𐀀

tom++

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to