Re: [Python-Dev] New Py_UNICODE doc

Martin v. Löwis Sun, 08 May 2005 01:53:16 -0700

M.-A. Lemburg wrote:
> Unicode has many code points that are meant only for composition
> and don't have any standalone meaning, e.g. a combining acute
> accent (U+0301), yet they are perfectly valid code points -
> regardless of UCS-2 or UCS-4. It is easily possible to break
> such a combining sequence using slicing, so the most
> often presented argument for using UCS-4 instead of UCS-2
> (+ surrogates) is rather weak if seen by daylight.


I disagree. It is not just about slicing, it is also about
searching for a character (either through the "in" operator,
or through regular expressions). If you define an SRE character
class, such a character class cannot hold a non-BMP character
in UTF-16 mode, but it can in UCS-4 mode. Consequently,
implementing XML's lexical classes (such as Name, NCName, etc.)
is much easier in UCS-4 than it is in UCS-2. In this case,
combining characters do not matter much, because the XML
spec is defined in terms of Unicode coded characters, causing
combining characters to appear as separate entities for lexical
purposes (unlike half surrogates).

Regards,
Martin
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] New Py_UNICODE doc

Reply via email to