On Wed, Jan 22, 2014 at 08:01:31AM +0100, Johan Råde wrote: > At the Leysin Sprint Armin outlined a new design of the PyPy 2 unicode > class. He gave two versions of the design: > > A: unicode with a UTF-8 implementation and a UTF-32 interface. > > B: unicode with a UTF-8 implementation, a UTF-16 interface on Windows > and a UTF-32 interface on UNIX-like systems.
With a UTF-8 implementation, won't that mean that string indexing operations are O(N) rather than O(1)? E.g. how do you know which UTF-8 byte(s) to look at to get the character at index 42 without having to walk the string from the start? Have you considered the Flexible String Representation from CPython 3.3? http://www.python.org/dev/peps/pep-0393/ Basically, if the largest code point in the string is U+00FF or below, it is implemented using one byte per character (essentially Latin-1); if the largest code point is U+FFFF or below, it is implemented using two bytes per character (essentially UCS-2); otherwise, it is implemented using four bytes per character (UCS-4 or UTF-32). There's more to the FSR, read the PEP for further detail. -- Steven _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev