Terry J. Reedy <[EMAIL PROTECTED]> added the comment: "Just to clarify: Python can be built as UCS2 or UCS4 build (not UTF-16 vs. UTF-32)"
I recently read most of the Unicode 5 standard and as near as I could tell it no longer uses the term UCS, if it ever did. Chapter 3 has only the following 3 hits. 1. "D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. • For historical reasons, the Unicode encoding forms are also referred to as Unicode (or UCS) transformation formats (UTF). That term is actually ambiguous between its usage for encoding forms and encoding schemes." 2. "For a discussion of the relationship between UTF-32 and UCS-4 encoding form defined in ISO/IEC 10646, see Section C.2, Encoding Forms in ISO/IEC 10646." Section C.2 says "UCS-4 can now be taken effectively as an alias for the Unicode encoding form UTF-32" and mentions the restriction of UCS-2 to the BMP. 3. "ISO/IEC 10646 specifies an equivalent UTF-16 encoding form. For details, see Section C.3, UCS Transformation Formats." U5 has 3 coding formats which it names UTF-8,16,32 and 7 serialization formats of the same name with plus the latter two with 'BE' or 'LE' append. So, to me, use of 'UCS' is either confusing or misleading. ---------------------- "If it really was UCS-2, the repr wouldn't be u'\U00010123' on windows. It'd be a pair of ill-formed code units instead." On WinXP,IDLE 3.0b2 >>> repr('\U00010123') # u prefix no longer needed or valid "'𐄣'" >>> repr('\ud800\udd23') "'𐄣'" # Interesting: what I cut from IDLE has 2 empty boxes instead of the one larger square with 010 and 123 I see on FireFox. len(repr('\U0010123')) is 4, not 3, so FireFox recognizes the surrogate and displays one symbol. Entering either directly into the interpreter gives Python 3.0b2 (r30b2:65106, Jul 18 2008, 18:44:17) [MSC v.1500 32 bit (Intel)] on win32 >>> c='\U00010123' >>> len(c) 2 >>> repr(c) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Program Files\Python30\lib\io.py", line 1428, in write b = encoder.encode(s) File "C:\Program Files\Python30\lib\encodings\cp437.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 2-3: character maps to <undefined> 2.5 gives instead "u'\\U00010123'" as reported, so I added 3.0 to the list of versions with a problem. I do wonder how can repr() work on IDLE but not the underlying interpreter? Could IDLE change self.errors so that <undefined> is left as is instead of raising an exception? With the display then replacing those with empty boxes? ---------- nosy: +tjreedy versions: +Python 3.0 _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3297> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com