M.-A. Lemburg wrote: > Unicode has many code points that are meant only for composition > and don't have any standalone meaning, e.g. a combining acute > accent (U+0301), yet they are perfectly valid code points - > regardless of UCS-2 or UCS-4. It is easily possible to break > such a combining sequence using slicing, so the most > often presented argument for using UCS-4 instead of UCS-2 > (+ surrogates) is rather weak if seen by daylight.
I disagree. It is not just about slicing, it is also about searching for a character (either through the "in" operator, or through regular expressions). If you define an SRE character class, such a character class cannot hold a non-BMP character in UTF-16 mode, but it can in UCS-4 mode. Consequently, implementing XML's lexical classes (such as Name, NCName, etc.) is much easier in UCS-4 than it is in UCS-2. In this case, combining characters do not matter much, because the XML spec is defined in terms of Unicode coded characters, causing combining characters to appear as separate entities for lexical purposes (unlike half surrogates). Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com