On Fri, Aug 26, 2011 at 3:57 PM, Terry Reedy <tjre...@udel.edu> wrote:
>
>
> On 8/26/2011 5:29 AM, "Martin v. Löwis" wrote:
>>>
>>> IronPython and Jython can retain UTF-16 as their native form if that
>>> makes interop cleaner, but in doing so they need to ensure that basic
>>> operations like indexing and len work in terms of code points, not
>>> code units, if they are to conform.
>
> My impression is that a UFT-16 implementation, to be properly called such,
> must do len and [] in terms of code points, which is why Python's narrow
> builds are called UCS-2 and not UTF-16.

I don't think anyone else has that impression. Please cite chapter and
verse if you really think this is important. IIUC, UCS-2 does not
allow surrogate pairs, whereas Python (and Java, and .NET, and
Windows) 16-bit strings all do support surrogate pairs. And they all
have a len or length function that counts code units, not code points.

>> That means that they won't conform, period. There is no efficient
>> maintainable implementation strategy to achieve that property,
>
> Given that both 'efficient' and 'maintainable' are relative terms, that is
> you pessimistic opinion, not really a fact.
>
>> it may take well years until somebody provides an efficient
>> unmaintainable implementation.
>>
>>> Does this make sense, or have I completely misunderstood things?
>>
>> You seem to assume it is ok for Jython/IronPython to provide indexing in
>> O(n). It is not.
>
> Why do you keep saying that O(n) is the alternative? I have already given a
> simple solution that is O(logk), where k is the number of non-BMP
> characters/codepoints/surrogate_pairs if there are any, and O(1) otherwise
> (for all BMP chars). It uses O(k) space. I think that is pretty efficient. I
> suspect that is the most time efficient possible without using at least as
> much space as a UCS-4 solution. The fact that you and other do not want this
> for CPython should not preclude other implementations that are more tied to
> UTF-16 from exploring the idea.
>
> Maintainability partly depends on whether all-codepoint support is built in
> or bolted on to a BMP-only implementation burdened with back compatibility
> for a code unit API. Maintainability is probably harder with a separate
> UTF-32 type, which CPython has but which I gather Jython and Iron-Python do
> not. It might or might not be easier is there were a separate internal
> character type containing a 32 bit code point value, so that interation and
> indexing (and single char slicing) always returned the same type of object
> regardless of whether the character was in the BMP or not. This certainly
> would help all the unicode database functions.
>
> Tom Christiansen appears to have said that Perl is or will use UTF-8 plus
> auxiliary arrays. If so, we will find out if they can maintain it.

Their API style is completely different from ours. What Perl can
maintain has little bearing on what Python can.

-- 
--Guido van Rossum (python.org/~guido)
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to