Tom Christiansen <tchr...@perl.com> added the comment: Ezio Melotti <rep...@bugs.python.org> wrote on Sat, 03 Sep 2011 00:28:03 -0000:
> Ezio Melotti <ezio.melo...@gmail.com> added the comment: > Or they are still called UTF-8 but used in combination with different error > handlers, like surrogateescape and surrogatepass. The "plain" UTF-* codecs > should produce data that can be used for "open interchange", rejecting all the > invalid data, both during encoding and decoding. > Chapter 03, D79 also says: > To ensure that the mapping for a Unicode encoding form is one-to-one, > all Unicode scalar values, including those corresponding to > noncharacter code points and unassigned code points, must be mapped to > unique code unit sequences. Note that this requirement does not extend > to high-surrogate and low-surrogate code points, which are excluded by > definition from the set of Unicode scalar values. > and this seems to imply that the only unencodable codepoint are the non-scalar > values, i.e. surrogates and codepoints >U+10FFFF. Noncharacters shouldn't > thus receive any special treatment (at least during encoding). > Tom, do you agree with this? What does Perl do with them? I agree that one needs to be able to encode any scalar value and store it in memory in a designated character encoding form. This is different from streams, though. The 3 different Unicode "character encoding *forms*" -- UTF-8, UTF-16, and UTF-32 -- certainly need to support all possible scalar values. These are the forms used to store code points in memory. They do not have BOMs, because one knows one's memory layout. These are specifically allowed to contain the noncharacters: http://www.unicode.org/reports/tr17/#CharacterEncodingForm The third type is peculiar to the Unicode Standard: the noncharacter. This is a kind of internal-use user-defined character, not intended for public interchange. The problem is that one must make a clean distinction between character encoding *forms* and character encoding *schemes*. http://www.unicode.org/reports/tr17/#CharacterEncodingScheme It is important not to confuse a Character Encoding Form (CEF) and a CES. 1. The CEF maps code points to code units, while the CES transforms sequences of code units to byte sequences. 2. The CES must take into account the byte-order serialization of all code units wider than a byte that are used in the CEF. 3. Otherwise identical CESs may differ in other aspects, such as the number of user-defined characters allowed. Some of the Unicode encoding schemes have the same labels as the three Unicode encoding forms. [...] As encoding schemes, UTF-16 and UTF-32 refer to serialized bytes, for example the serialized bytes for streaming data or in files; they may have either byte orientation, and a single BOM may be present at the start of the data. When the usage of the abbreviated designators UTF-16 or UTF-32 might be misinterpreted, and where a distinction between their use as referring to Unicode encoding forms or to Unicode encoding schemes is important, the full terms should be used. For example, use UTF-16 encoding form or UTF-16 encoding scheme. They may also be abbreviated to UTF-16 CEF or UTF-16 CES, respectively. The Unicode Standard has seven character encoding schemes: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. * UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are simple CESs. * UTF-16 and UTF-32 are compound CESs, consisting of an single, optional byte order mark at the start of the data followed by a simple CES. I believe that what this comes down to is that you can have noncharacters in memory as a CEF, but that you cannot have them in a CES meant for open interchange. And what you do privately is a different, third matter. What Perl does differs somewhat depending on whether you are just playing around with encodings in memory verus using streams that have particular encodings associated with them. I belive that you can think of this as the first being for CEF stuff and the second is for CES stuff. Streams are strict. Memory isn't. Perl will never ever produce nor accept one of the 66 noncharacers on any stream marked as one of the 7 character encoding schemes. However, we aren't always good about whether we generate an exception or whether we return replacement characters. Here the first process created a (for the nonce, nonfatal) warning, whereas the second process raised an exception: % perl -wle 'binmode(STDOUT, "encoding(UTF-16)")|| die; print chr(0xFDD0)' | perl -wle 'binmode(STDIN, "encoding(UTF-16)")||die; print ord <STDIN>' Unicode non-character U+FDD0 is illegal for open interchange at -e line 1. UTF-16:Unicode character fdd0 is illegal at -e line 1. Exit 255 Here the first again makes a warning, and the second returns a replacement string because: % perl -wle 'binmode(STDOUT, "encoding(UTF-8)")|| die; print chr(0xFDD0)' | perl -wle 'binmode(STDIN, "encoding(UTF-8)")||die; print ord <STDIN>' Unicode non-character U+FDD0 is illegal for open interchange at -e line 1. "\x{fdd0}" does not map to utf8. 92 If you call encode() manually, you have a lot clearer control over this, beause you can specify what to do with invalid characters (exceptions, replacements, etc). We have a flavor of non-strict utf8, spelled "utf8" instead of "UTF-8", that can produce and accept illegal characters, although by default it is still going to generate a warning: % perl -wle 'binmode(STDOUT, "encoding(utf8)")|| die; print chr(0xFDD0)' | perl -wle 'binmode(STDIN, "encoding(utf8)")||die; print ord <STDIN>' Unicode non-character U+FDD0 is illegal for open interchange at -e line 1. 64976 I could talk about ways to control whether it's a warning or an exception or a replacement string or nothing at all, but suffice to say such mechanisms do exist. I just don't know that I agree with the defaults. I think a big problem here is that the Python culture doesn't use stream encodings enough. People are always making their own repeated and tedious calls to encode and then sending stuff out a byte stream, by which time it is too late to check. This is a real problem, because now you cannot be permissive for the CES but conservative for the CEF. In Perl this doesn't in practice happen because in Perl people seldom send the result of encode() out a byte stream; they send things out character streams that have proper encodings affiliated with them. Yes, you can do it, but then you lose the checks. That's not a good idea. Anything that deals with streams should have an encoding argument. But often/many? things in Python don't. For example, subprocess.Popen doesn't even seem to take an encoding argument. This makes people do things by hand too often. In fact, subprocess.Popen won't even accept normal (Python 3 Unicode) strings, which is a real pain. I do think the culture of calling .encode("utf8") all over the place needs to be replaced with a more stream-based approach in Python. I had another place where this happens too much in Python besides subprocess.Popen but I can't remember where it is right now. Perl's internal name for the strict utf stuff is for example "utf-8-strict". I think you probably want to distingish these, and make the default strict the way we do with "UTF-8". We do not ever allow nonstrict UTF-16 or UTF-32, only sometimes nonstrict UTF-8 if you call it "utf8". I quote a bit of the perlunicode manpage below which talks about this a bit. Sorry it's taken me so long to get back to you on this. I'd be happy to answer any further questions you might have. --tom PERLUNICODE(1) Perl Programmers Reference Guide PERLUNICODE(1) Non-character code points 66 code points are set aside in Unicode as "non-character code points". These all have the Unassigned (Cn) General Category, and they never will be assigned. These are never supposed to be in legal Unicode input streams, so that code can use them as sentinels that can be mixed in with character data, and they always will be distinguishable from that data. To keep them out of Perl input streams, strict UTF-8 should be specified, such as by using the layer ":encoding('UTF-8')". The non-character code points are the 32 between U+FDD0 and U+FDEF, and the 34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF. Some people are under the mistaken impression that these are "illegal", but that is not true. An application or cooperating set of applications can legally use them at will internally; but these code points are "illegal for open interchange". Therefore, Perl will not accept these from input streams unless lax rules are being used, and will warn (using the warning category "nonchar", which is a sub-category of "utf8") if an attempt is made to output them. Beyond Unicode code points The maximum Unicode code point is U+10FFFF. But Perl accepts code points up to the maximum permissible unsigned number available on the platform. However, Perl will not accept these from input streams unless lax rules are being used, and will warn (using the warning category "non_unicode", which is a sub-category of "utf8") if an attempt is made to operate on or output them. For example, "uc(0x11_0000)" will generate this warning, returning the input parameter as its result, as the upper case of every non-Unicode code point is the code point itself. perl v5.14.0 2011-05-07 26 ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12729> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com