New submission from Alexander Belopolsky <[email protected]>:
As discussed in issue 10521 and the sprawling "len(chr(i)) = 2?" thread [1] on
python-dev, many functions in python library behave differently on narrow and
wide builds. While there are unavoidable differences such as the length of
strings with non-BMP characters, many functions can work around these
differences. For example, the ord() function already produces integers over
0xFFFF when given a surrogate pair as a string of length two on a narrow build.
Other functions such as str.isalpha(), are not yet aware of surrogates. See
also issue9200.
A consensus is developing that non-BMP characters support on narrow builds is
here to stay and that naive functions should be fixed. Unfortunately, working
with surrogates in python code is tricky because unicode C-API does not provide
much support and existing examples of surrogate processing look like this:
- while (u != uend && w != wend) {
- if (0xD800 <= u[0] && u[0] <= 0xDBFF
- && 0xDC00 <= u[1] && u[1] <= 0xDFFF)
- {
- *w = (((u[0] & 0x3FF) << 10) | (u[1] & 0x3FF)) + 0x10000;
- u += 2;
- }
- else {
- *w = *u;
- u++;
- }
- w++;
- }
The attached patch introduces a Py_UNICODE_NEXT() macro that allows replacing
the code above with two lines:
+ while (u != uend && w != wend)
+ *w++ = Py_UNICODE_NEXT(u, uend);
The patch also introduces a set of macros for manipulating the surrogates, but
I have not started replacing more instances of verbose surrogate processing
because I would like to first look for higher level abstractions such as
Py_UNICODE_NEXT(). For example, there are many instances that can benefit from
Py_UNICODE_PUT_NEXT(ptr, ch) macro that would put a UCS4 character ch into
Py_UNICODE buffer pointed by ptr and advance ptr by 1 or 2 units as necessary.
[1] http://mail.python.org/pipermail/python-dev/2010-November/105908.html
----------
assignee: belopolsky
components: Extension Modules, Interpreter Core, Unicode
files: unicode-next.diff
keywords: patch
messages: 122464
nosy: Rhamphoryncus, amaury.forgeotdarc, belopolsky, eric.smith, ezio.melotti,
lemburg, pitrou
priority: normal
severity: normal
stage: patch review
status: open
title: Py_UNICODE_NEXT and other macros for surrogates
type: feature request
versions: Python 3.2
Added file: http://bugs.python.org/file19825/unicode-next.diff
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue10542>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com