Re: [Python-Dev] PEP 393 Summer of Code Project

Terry Reedy Fri, 26 Aug 2011 17:00:30 -0700


On 8/26/2011 5:29 AM, "Martin v. Löwis" wrote:

IronPython and Jython can retain UTF-16 as their native form if that
makes interop cleaner, but in doing so they need to ensure that basic
operations like indexing and len work in terms of code points, not
code units, if they are to conform.

My impression is that a UFT-16 implementation, to be properly calledsuch, must do len and [] in terms of code points, which is why Python'snarrow builds are called UCS-2 and not UTF-16.

That means that they won't conform, period. There is no efficient
maintainable implementation strategy to achieve that property,

Given that both 'efficient' and 'maintainable' are relative terms, thatis you pessimistic opinion, not really a fact.

it may take well years until somebody provides an efficient
unmaintainable implementation.

Does this make sense, or have I completely misunderstood things?


You seem to assume it is ok for Jython/IronPython to provide indexing in
O(n). It is not.

Why do you keep saying that O(n) is the alternative? I have alreadygiven a simple solution that is O(logk), where k is the number ofnon-BMP characters/codepoints/surrogate_pairs if there are any, and O(1)otherwise (for all BMP chars). It uses O(k) space. I think that ispretty efficient. I suspect that is the most time efficient possiblewithout using at least as much space as a UCS-4 solution. The fact thatyou and other do not want this for CPython should not preclude otherimplementations that are more tied to UTF-16 from exploring the idea.

Maintainability partly depends on whether all-codepoint support is builtin or bolted on to a BMP-only implementation burdened with backcompatibility for a code unit API. Maintainability is probably harderwith a separate UTF-32 type, which CPython has but which I gather Jythonand Iron-Python do not. It might or might not be easier is there were aseparate internal character type containing a 32 bit code point value,so that interation and indexing (and single char slicing) alwaysreturned the same type of object regardless of whether the character wasin the BMP or not. This certainly would help all the unicode databasefunctions.

Tom Christiansen appears to have said that Perl is or will use UTF-8plus auxiliary arrays. If so, we will find out if they can maintain it.


---
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

Reply via email to