Re: [Python-Dev] thoughts on the bytes/string discussion

Stefan Behnel Tue, 06 Jul 2010 23:58:34 -0700

Ronald Oussoren, 06.07.2010 16:51:

On 27 Jun, 2010, at 11:48, Greg Ewing wrote:

Stefan Behnel wrote:

Greg Ewing, 26.06.2010 09:58:

Would there be any sanity in having an option to compile Python
with UTF-8 as the internal string representation?

It would break Py_UNICODE, because the internal size of a unicode
character would no longer be fixed.


It's not fixed anyway with the 2-char build -- some characters are
represented using a pair of surrogates.


It is for practical purposes not even fixed in 4-char builds. In 4-char
builds every Unicode code points corresponds to one item in a python
unicode string, but a base characters with combining characters is still
a sequence of characters and should IMHO almost always be treated as a
single object. As an example, given s="be\N{COMBINING DIAERESIS}" s[:2]
or s[2:] is almost certainly semanticly invalid.

Sure. However, this is not a problem for the purpose of the C-API,especially for Cython (which is the angle from which I brought this up).All Cython cares about is that it mimics CPython's sematics excactly whentransforming code, and a CPython runtime will ignore surrogate pairs andcombining characters during iteration and indexing, and when determiningthe string length. So a single character unicode string can currently besafely aliased by Py_UNICODE with correct Python semantics. That would nolonger be the case if the internal representation switched to UTF-8 and/orif CPython started to take surrogates and combining characters into accountwhen considering the string length.

Note that it's impossible to determine if a unicode string containssurrogate pairs because it's running on a narrow unicode build or becausethe user entered them into the string. But the user would likely expect thesecond case to treat them as separate code points, whereas the first is animplementation detail that should normally be invisible. Combiningcharacters are a lot clearer here, as they can only be entered by users, sokeeping them separate as provided is IMHO the expected behaviour.

I think the main theme here is that the interpretation of code points andtheir transformation for user interfaces and backends is left to the usercode. Py_UNICODE represents a code point in the current system, includingsurrogate pair 'escapes'. And that would change if the underlying encodingswitched to something other than UTF-16/UCS-4.


Stefan

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] thoughts on the bytes/string discussion

Reply via email to