Re: Code points vs Unicode scalar values

Norbert Lindenberg Thu, 05 Sep 2013 12:26:08 -0700

On Sep 4, 2013, at 14:28 , Brendan Eich <[email protected]> wrote:

> Anne van Kesteren wrote:
>>> Here's the spec for String.prototype.codePointAt:
>>> >
>>> >  8. Let first be the code unit value of the element at index position in 
>>> > the
>>> >  String S.
>>> >  11. If second<  0xDC00 or second>  0xDFFF, then return first.
>>> >
>>> >  I take it you are objecting to step 11?
>> 
>> And step 8. The indexing is based on code units so you cannot actually
>> do indexing easily. You'd need to use the iterator to iterate over a
>> string getting only code points out.
>> 
>> 
>>>> >>  The indexing of codePointAt() is also kind of sad as it just passes
>>>> >>  through to charCodeAt(),
>>> >
>>> >  I don't see that in the spec cited above.
>> 
>> How do you read step 8?
> 
> 8. Let first be the code unit value of the element at index position in the 
> String S.
> 
> This does not "[pass] through to charCodeAt()" literally, which would mean a 
> call to S.charCodeAt(position). I thought that's what you meant.
> 
> So you want a code point index, not a code unit index. That would not be 
> useful for the lower-level purposes Allen identified. Again it seems you're 
> trying to abstract away from all the details that probably will matter for 
> string hackers using these APIs. But I summon Norbert at this point!

Previous discussion of allowing surrogate code points:
https://mail.mozilla.org/pipermail/es-discuss/2012-December/thread.html#27057
https://mail.mozilla.org/pipermail/es-discuss/2013-January/thread.html#28086
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/thread.html#29

Essentially, ECMAScript strings are Unicode strings as defined in The Unicode 
Standard section 2.7, and thus may contain unpaired surrogate code units in 
their 16-bit form or surrogate code points when interpreted as 32-bit 
sequences. String.fromCodePoint and String.prototype.codePointAt just convert 
between 16-bit and 32-bit forms; they're not meant to interpret the code points 
beyond that, and some processing (such as test cases) may depend on them being 
preserved. This is different from encoding for communication over networks, 
where the use of valid UTF-8 or UTF-16 (which cannot contain surrogate code 
points) is generally required.

The indexing issue was first discussed in the form "why can't we just use 
UTF-32"? See
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#UTF32
for pointers to that. It would have been great to use UTF-8, but it's 
unfortunately not compatible with the past and the DOM.

Adding code point indexing to 16-bit code unit strings would add significant 
performance overhead. In reality, whether an index is for 16-bit or 32-bit 
units matters only for some relatively low-level software that needs to process 
code point by code point. A lot of software deals with complete strings without 
ever looking inside, or is fine processing code unit by code unit (e.g., 
String.prototype.indexOf).

Norbert
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Code points vs Unicode scalar values

Reply via email to