At the Leysin Sprint Armin outlined a new design of the PyPy 2 unicode
class. He gave two versions of the design:
A: unicode with a UTF-8 implementation and a UTF-32 interface.
B: unicode with a UTF-8 implementation, a UTF-16 interface on Windows
and a UTF-32 interface on UNIX-like systems.
(Armin claims that this design can be implemented efficiently.)
Armin asked me (a Windows developer) which design I prefer.
My answer is A. Why?
Short answer: Because surrogate pairs are a pain in the ...
------- Long answer --------
** A well-designed interface should accurately model the problem domain. **
In this case the problem domain contains the following abstract objects:
1. Unicode code points (i.e. integers in [0, 0x110000))
2. Unicode characters
3. Unicode strings
The relations between these abstract objects are:
* There is a 1-1 correspondence between Unicode code points and
Unicode characters.
* Unicode strings are abstract sequences of Unicode characters.
With a UTF-32 interface there is a 1-1 correspondence between these
abstract objects and the following concrete Python 2 objects:
1'. int objects n with 0 <= n < 0x110000
2'. unicode objects s with len(s) == 1
3'. unicode objects
With a UTF-16 interface there is not a 1-1 correspondence between 2 and
2'. Then we are not modelling the problem domain any more, and we have
to deal with nonsense such as:
>>> s = u'\N{PHAISTOS DISC SIGN FLUTE}'
>>> len(s)
2
Next, would such a change break any existing Python 2 code on Windows?
Yes it will. For instance the following code for counting characters in
a string:
f = [0] * (1 << 16)
for c in s:
f[ord(c)] += 1
On the other hand, some broken code will be fixed by this change.
There is a lot of Windows code that does not handle surrogate pairs
correctly. For instance, in an earlier version of Notepad,
you had to press Delete twice to delete a character with Unicode code
point >= 0x10000.
Such broken code might be fixed by this change.
I believe that fixing broken code will be more common than breaking
working code, so the overall effect on existing code will be beneficial.
----------------------
Finally, I very much dislike the term UTF-32 interface.
You can, and should, explain that interface without mentioning
encodings at all; just describe the problem domain and how the interface
models the problem domain.
Encodings should be an implementation detail, and should not leak
through the interface at all. Python should make you think in terms of
abstractions, not in terms of bytes. Python is not C.
Let's call it the natural interface.
(I hope this makes more sense than my ramblings on IRC last night.)
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev