Ronald Oussoren, 06.07.2010 16:51:
On 27 Jun, 2010, at 11:48, Greg Ewing wrote:
Stefan Behnel wrote:
Greg Ewing, 26.06.2010 09:58:
Would there be any sanity in having an option to compile Python
with UTF-8 as the internal string representation?
It would break Py_UNICODE, because the internal size of a unicode
character would no longer be fixed.
It's not fixed anyway with the 2-char build -- some characters are
represented using a pair of surrogates.
It is for practical purposes not even fixed in 4-char builds. In 4-char
builds every Unicode code points corresponds to one item in a python
unicode string, but a base characters with combining characters is still
a sequence of characters and should IMHO almost always be treated as a
single object. As an example, given s="be\N{COMBINING DIAERESIS}" s[:2]
or s[2:] is almost certainly semanticly invalid.
Sure. However, this is not a problem for the purpose of the C-API,
especially for Cython (which is the angle from which I brought this up).
All Cython cares about is that it mimics CPython's sematics excactly when
transforming code, and a CPython runtime will ignore surrogate pairs and
combining characters during iteration and indexing, and when determining
the string length. So a single character unicode string can currently be
safely aliased by Py_UNICODE with correct Python semantics. That would no
longer be the case if the internal representation switched to UTF-8 and/or
if CPython started to take surrogates and combining characters into account
when considering the string length.
Note that it's impossible to determine if a unicode string contains
surrogate pairs because it's running on a narrow unicode build or because
the user entered them into the string. But the user would likely expect the
second case to treat them as separate code points, whereas the first is an
implementation detail that should normally be invisible. Combining
characters are a lot clearer here, as they can only be entered by users, so
keeping them separate as provided is IMHO the expected behaviour.
I think the main theme here is that the interpretation of code points and
their transformation for user interfaces and backends is left to the user
code. Py_UNICODE represents a code point in the current system, including
surrogate pair 'escapes'. And that would change if the underlying encoding
switched to something other than UTF-16/UCS-4.
Stefan
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com