[pypy-dev] PyPy 2 unicode class

Johan Råde Tue, 21 Jan 2014 23:02:20 -0800

At the Leysin Sprint Armin outlined a new design of the PyPy 2 unicodeclass. He gave two versions of the design:


 A: unicode with a UTF-8 implementation and a UTF-32 interface.

B: unicode with a UTF-8 implementation, a UTF-16 interface on Windowsand a UTF-32 interface on UNIX-like systems.


(Armin claims that this design can be implemented efficiently.)

Armin asked me (a Windows developer) which design I prefer.

My answer is A. Why?

Short answer: Because surrogate pairs are a pain in the ...

------- Long answer --------

** A well-designed interface should accurately model the problem domain. **

In this case the problem domain contains the following abstract objects:
 1. Unicode code points (i.e. integers in [0, 0x110000))
 2. Unicode characters
 3. Unicode strings

The relations between these abstract objects are:

* There is a 1-1 correspondence between Unicode code points andUnicode characters.

 * Unicode strings are abstract sequences of Unicode characters.

With a UTF-32 interface there is a 1-1 correspondence between theseabstract objects and the following concrete Python 2 objects:

 1'. int objects n with 0 <= n < 0x110000
 2'. unicode objects s with len(s) == 1
 3'. unicode objects

With a UTF-16 interface there is not a 1-1 correspondence between 2 and2'. Then we are not modelling the problem domain any more, and we haveto deal with nonsense such as:

 >>> s = u'\N{PHAISTOS DISC SIGN FLUTE}'
 >>> len(s)
 2

Next, would such a change break any existing Python 2 code on Windows?

Yes it will. For instance the following code for counting characters ina string:


 f = [0] * (1 << 16)
 for c in s:
     f[ord(c)] += 1

On the other hand, some broken code will be fixed by this change.

There is a lot of Windows code that does not handle surrogate pairscorrectly. For instance, in an earlier version of Notepad,you had to press Delete twice to delete a character with Unicode codepoint >= 0x10000.

Such broken code might be fixed by this change.

I believe that fixing broken code will be more common than breakingworking code, so the overall effect on existing code will be beneficial.


----------------------

Finally, I very much dislike the term UTF-32 interface.

You can, and should, explain that interface without mentioningencodings at all; just describe the problem domain and how the interfacemodels the problem domain.

Encodings should be an implementation detail, and should not leakthrough the interface at all. Python should make you think in terms ofabstractions, not in terms of bytes. Python is not C.


Let's call it the natural interface.

(I hope this makes more sense than my ramblings on IRC last night.)













_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

[pypy-dev] PyPy 2 unicode class

Reply via email to