Ronald Oussoren, 06.07.2010 16:51:
On 27 Jun, 2010, at 11:48, Greg Ewing wrote:

Stefan Behnel wrote:
Greg Ewing, 26.06.2010 09:58:
Would there be any sanity in having an option to compile Python
with UTF-8 as the internal string representation?
It would break Py_UNICODE, because the internal size of a unicode
character would no longer be fixed.

It's not fixed anyway with the 2-char build -- some characters are
represented using a pair of surrogates.

It is for practical purposes not even fixed in 4-char builds. In 4-char
builds every Unicode code points corresponds to one item in a python
unicode string, but a base characters with combining characters is still
a sequence of characters and should IMHO almost always be treated as a
single object. As an example, given s="be\N{COMBINING DIAERESIS}" s[:2]
or s[2:] is almost certainly semanticly invalid.

Sure. However, this is not a problem for the purpose of the C-API, especially for Cython (which is the angle from which I brought this up). All Cython cares about is that it mimics CPython's sematics excactly when transforming code, and a CPython runtime will ignore surrogate pairs and combining characters during iteration and indexing, and when determining the string length. So a single character unicode string can currently be safely aliased by Py_UNICODE with correct Python semantics. That would no longer be the case if the internal representation switched to UTF-8 and/or if CPython started to take surrogates and combining characters into account when considering the string length.

Note that it's impossible to determine if a unicode string contains surrogate pairs because it's running on a narrow unicode build or because the user entered them into the string. But the user would likely expect the second case to treat them as separate code points, whereas the first is an implementation detail that should normally be invisible. Combining characters are a lot clearer here, as they can only be entered by users, so keeping them separate as provided is IMHO the expected behaviour.

I think the main theme here is that the interpretation of code points and their transformation for user interfaces and backends is left to the user code. Py_UNICODE represents a code point in the current system, including surrogate pair 'escapes'. And that would change if the underlying encoding switched to something other than UTF-16/UCS-4.

Stefan

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to