Am 22.11.2010 11:47, schrieb Stephen J. Turnbull: > "Martin v. Löwis" writes: > > > More interestingly (and to the subject) is chr: how did you arrive > > at C9 banning Python3's definition of chr? This chr function puts > > the code sequence into well-formed UTF-16; that's the whole point of > > UTF-16. > > No, it doesn't, in the specific case of surrogate code points. In > 3.1.2 from MacPorts on a iBook G4 and from Gentoo on AMD64, > chr(0xd800) returns "\ud800".
Ah, I see - this is *not* the subject's issue, right? > > I don't know if that's by design (eg, so that it can be used in the > implementation of the surrogateescape error handler) or a correctable > oversight, but it's not conformant. I disagree: Quoting from Unicode 5.0, section 5.4: # The individual components of implementations may have different # levels of support for surrogates, as long as those components are # assembled and communicate correctly. Low-level string processing, # where a Unicode string is not interpreted but is handled simply as an # array of code units, may ignore surrogate pairs. With such strings, # for example, a truncation operation with an arbitrary offset might # break a surrogate pair. (For further discussion, see Section 2.7, # Unicode Strings.) For performance in string operations, such behavior # is reasonable at a low level, but it requires higher-level processes # to ensure that offsets are on character boundaries so as to guarantee # the integrity of surrogate pairs. So lower-level routines (which I claim chr() is one) are allowed to create lone surrogates. The formal requirement behind this is C1: # A process shall not interpret a high-surrogate code point or a # low-surrogate code point as an abstract character. I also claim that Python, in both narrow and wide mode, conforms to this requirement. Notice that the requirement is a ban on interpreting the code point as a character. In particular, unicodedata.category claims that the code point is of class Cs (surrogate), which I consider conforming. By the same line of reasoning, it is also OK that chr() allows the creation of unassigned code points, even though C2 says that they must not be interpreted as abstract characters. The rationale for supporting these characters in chr() goes back much further than the surrogateescape handler - as Python unicode strings are sequences of code points, it would be impractical if you couldn't create some of them, or even would have to consult the UCD before determining whether they can be created. Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com