On 8/26/2011 5:29 AM, "Martin v. Löwis" wrote:
IronPython and Jython can retain UTF-16 as their native form if that
makes interop cleaner, but in doing so they need to ensure that basic
operations like indexing and len work in terms of code points, not
code units, if they are to conform.

My impression is that a UFT-16 implementation, to be properly called such, must do len and [] in terms of code points, which is why Python's narrow builds are called UCS-2 and not UTF-16.

That means that they won't conform, period. There is no efficient
maintainable implementation strategy to achieve that property,

Given that both 'efficient' and 'maintainable' are relative terms, that is you pessimistic opinion, not really a fact.

it may take well years until somebody provides an efficient
unmaintainable implementation.

Does this make sense, or have I completely misunderstood things?

You seem to assume it is ok for Jython/IronPython to provide indexing in
O(n). It is not.

Why do you keep saying that O(n) is the alternative? I have already given a simple solution that is O(logk), where k is the number of non-BMP characters/codepoints/surrogate_pairs if there are any, and O(1) otherwise (for all BMP chars). It uses O(k) space. I think that is pretty efficient. I suspect that is the most time efficient possible without using at least as much space as a UCS-4 solution. The fact that you and other do not want this for CPython should not preclude other implementations that are more tied to UTF-16 from exploring the idea.

Maintainability partly depends on whether all-codepoint support is built in or bolted on to a BMP-only implementation burdened with back compatibility for a code unit API. Maintainability is probably harder with a separate UTF-32 type, which CPython has but which I gather Jython and Iron-Python do not. It might or might not be easier is there were a separate internal character type containing a 32 bit code point value, so that interation and indexing (and single char slicing) always returned the same type of object regardless of whether the character was in the BMP or not. This certainly would help all the unicode database functions.

Tom Christiansen appears to have said that Perl is or will use UTF-8 plus auxiliary arrays. If so, we will find out if they can maintain it.

---
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to