[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Tom Christiansen Wed, 07 Sep 2011 12:26:55 -0700

Tom Christiansen <[email protected]> added the comment:

Ezio Melotti <[email protected]> wrote
   on Sat, 03 Sep 2011 00:28:03 -0000:


> Ezio Melotti <[email protected]> added the comment:

> Or they are still called UTF-8 but used in combination with different error
> handlers, like surrogateescape and surrogatepass.  The "plain" UTF-* codecs
> should produce data that can be used for "open interchange", rejecting all the
> invalid data, both during encoding and decoding.

> Chapter 03, D79 also says:

>        To ensure that the mapping for a Unicode encoding form is one-to-one,
>        all Unicode scalar values, including those corresponding to
>        noncharacter code points and unassigned code points, must be mapped to
>        unique code unit sequences. Note that this requirement does not extend
>        to high-surrogate and low-surrogate code points, which are excluded by
>        definition from the set of Unicode scalar values.

> and this seems to imply that the only unencodable codepoint are the non-scalar
> values, i.e. surrogates and codepoints >U+10FFFF.  Noncharacters shouldn't
> thus receive any special treatment (at least during encoding).

> Tom, do you agree with this?  What does Perl do with them?

I agree that one needs to be able to encode any scalar value and
store it in memory in a designated character encoding form.

This is different from streams, though.

The 3 different Unicode "character encoding *forms*" -- UTF-8,
UTF-16, and UTF-32 -- certainly need to support all possible
scalar values.  These are the forms used to store code points in
memory.  They do not have BOMs, because one knows one's memory
layout.   These are specifically allowed to contain the
noncharacters:

    http://www.unicode.org/reports/tr17/#CharacterEncodingForm

    The third type is peculiar to the Unicode Standard: the noncharacter.
    This is a kind of internal-use user-defined character, not intended for
    public interchange.

The problem is that one must make a clean distinction between character
encoding *forms* and character encoding *schemes*.

    http://www.unicode.org/reports/tr17/#CharacterEncodingScheme

    It is important not to confuse a Character Encoding Form (CEF) and a CES.

    1. The CEF maps code points to code units, while the CES transforms
       sequences of code units to byte sequences.
    2. The CES must take into account the byte-order serialization of
       all code units wider than a byte that are used in the CEF.
    3. Otherwise identical CESs may differ in other aspects, such as the
       number of user-defined characters allowed.

    Some of the Unicode encoding schemes have the same labels as the three
    Unicode encoding forms. [...]

    As encoding schemes, UTF-16 and UTF-32 refer to serialized bytes, for
    example the serialized bytes for streaming data or in files; they may have
    either byte orientation, and a single BOM may be present at the start of the
    data. When the usage of the abbreviated designators UTF-16 or UTF-32 might
    be misinterpreted, and where a distinction between their use as referring to
    Unicode encoding forms or to Unicode encoding schemes is important, the full
    terms should be used. For example, use UTF-16 encoding form or UTF-16
    encoding scheme. They may also be abbreviated to UTF-16 CEF or UTF-16 CES,
    respectively.

    The Unicode Standard has seven character encoding schemes: UTF-8, UTF-16,
    UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.

        * UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are simple CESs.

        * UTF-16 and UTF-32 are compound CESs, consisting of an single, optional
          byte order mark at the start of the data followed by a simple CES.

I believe that what this comes down to is that you can have noncharacters in 
memory
as a CEF, but that you cannot have them in a CES meant for open interchange.
And what you do privately is a different, third matter.

What Perl does differs somewhat depending on whether you are just playing
around with encodings in memory verus using streams that have particular
encodings associated with them.  I belive that you can think of this as the
first being for CEF stuff and the second is for CES stuff.

Streams are strict.  Memory isn't.

Perl will never ever produce nor accept one of the 66 noncharacers on any
stream marked as one of the 7 character encoding schemes.  However, we
aren't always good about whether we generate an exception or whether we
return replacement characters.  

Here the first process created a (for the nonce, nonfatal) warning, 
whereas the second process raised an exception:

     %   perl -wle 'binmode(STDOUT, "encoding(UTF-16)")|| die; print 
chr(0xFDD0)' | 
         perl -wle 'binmode(STDIN, "encoding(UTF-16)")||die; print ord <STDIN>'
    Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
    UTF-16:Unicode character fdd0 is illegal at -e line 1.
    Exit 255

Here the first again makes a warning, and the second returns a replacement
string because:

    % perl -wle 'binmode(STDOUT, "encoding(UTF-8)")|| die; print chr(0xFDD0)' | 
        perl -wle 'binmode(STDIN, "encoding(UTF-8)")||die; print ord <STDIN>'
    Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
    "\x{fdd0}" does not map to utf8.
    92

If you call encode() manually, you have a lot clearer control over this, 
beause you can specify what to do with invalid characters (exceptions,
replacements, etc).

We have a flavor of non-strict utf8, spelled "utf8" instead of "UTF-8", that
can produce and accept illegal characters, although by default it is still
going to generate a warning:

    %   perl -wle 'binmode(STDOUT, "encoding(utf8)")|| die; print chr(0xFDD0)' 
| 
        perl -wle 'binmode(STDIN, "encoding(utf8)")||die; print ord <STDIN>'
    Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
    64976

I could talk about ways to control whether it's a warning or an exception
or a replacement string or nothing at all, but suffice to say such
mechanisms do exist.  I just don't know that I agree with the defaults.

I think a big problem here is that the Python culture doesn't use stream
encodings enough.  People are always making their own repeated and tedious
calls to encode and then sending stuff out a byte stream, by which time it
is too late to check.  This is a real problem, because now you cannot be
permissive for the CES but conservative for the CEF.  

In Perl this doesn't in practice happen because in Perl people seldom send
the result of encode() out a byte stream; they send things out character
streams that have proper encodings affiliated with them.  Yes, you can do
it, but then you lose the checks.  That's not a good idea.

Anything that deals with streams should have an encoding argument.  But
often/many? things in Python don't.  For example, subprocess.Popen
doesn't even seem to take an encoding argument.  This makes people do
things by hand too often.  In fact, subprocess.Popen won't even accept
normal (Python 3 Unicode) strings, which is a real pain.  I do think the
culture of calling .encode("utf8") all over the place needs to be
replaced with a more stream-based approach in Python.  I had another
place where this happens too much in Python besides subprocess.Popen but
I can't remember where it is right now.

Perl's internal name for the strict utf stuff is for example "utf-8-strict".
I think you probably want to distingish these, and make the default strict
the way we do with "UTF-8".  We do not ever allow nonstrict UTF-16 or UTF-32,
only sometimes nonstrict UTF-8 if you call it "utf8".

I quote a bit of the perlunicode manpage below which talks about this a bit.

Sorry it's taken me so long to get back to you on this.  I'd be happy to answer
any further questions you might have.

--tom

        PERLUNICODE(1)   Perl Programmers Reference Guide  PERLUNICODE(1)

           Non-character code points
               66 code points are set aside in Unicode as "non-character code
               points".  These all have the Unassigned (Cn) General Category, 
and
               they never will be assigned.  These are never supposed to be in
               legal Unicode input streams, so that code can use them as 
sentinels
               that can be mixed in with character data, and they always will be
               distinguishable from that data.  To keep them out of Perl input
               streams, strict UTF-8 should be specified, such as by using the
               layer ":encoding('UTF-8')".  The non-character code points are 
the
               32 between U+FDD0 and U+FDEF, and the 34 code points U+FFFE,
               U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.  Some people 
are
               under the mistaken impression that these are "illegal", but that 
is
               not true.  An application or cooperating set of applications can
               legally use them at will internally; but these code points are
               "illegal for open interchange". Therefore, Perl will not accept
               these from input streams unless lax rules are being used, and 
will
               warn (using the warning category "nonchar", which is a 
sub-category
               of "utf8") if an attempt is made to output them.

           Beyond Unicode code points
               The maximum Unicode code point is U+10FFFF.  But Perl accepts 
code
               points up to the maximum permissible unsigned number available on
               the platform.  However, Perl will not accept these from input
               streams unless lax rules are being used, and will warn (using the
               warning category "non_unicode", which is a sub-category of 
"utf8")
               if an attempt is made to operate on or output them.  For example,
               "uc(0x11_0000)" will generate this warning, returning the input
               parameter as its result, as the upper case of every non-Unicode
               code point is the code point itself.

        perl v5.14.0                2011-05-07                         26

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to