Re: Code points vs Unicode scalar values

Brendan Eich Thu, 05 Sep 2013 14:40:04 -0700

Thanks for the reminders -- we've been over this.

/be

Norbert Lindenberg <mailto:[email protected]>
September 5, 2013 12:07 PM

Previous discussion of allowing surrogate code points:
https://mail.mozilla.org/pipermail/es-discuss/2012-December/thread.html#27057https://mail.mozilla.org/pipermail/es-discuss/2013-January/thread.html#28086
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/thread.html#29
Essentially, ECMAScript strings are Unicode strings as defined in TheUnicode Standard section 2.7, and thus may contain unpaired surrogatecode units in their 16-bit form or surrogate code points wheninterpreted as 32-bit sequences. String.fromCodePoint andString.prototype.codePointAt just convert between 16-bit and 32-bitforms; they're not meant to interpret the code points beyond that, andsome processing (such as test cases) may depend on them beingpreserved. This is different from encoding for communication overnetworks, where the use of valid UTF-8 or UTF-16 (which cannot containsurrogate code points) is generally required.
The indexing issue was first discussed in the form "why can't we justuse UTF-32"? Seehttp://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#UTF32for pointers to that. It would have been great to use UTF-8, but it'sunfortunately not compatible with the past and the DOM.
Adding code point indexing to 16-bit code unit strings would addsignificant performance overhead. In reality, whether an index is for16-bit or 32-bit units matters only for some relatively low-levelsoftware that needs to process code point by code point. A lot ofsoftware deals with complete strings without ever looking inside, oris fine processing code unit by code unit (e.g.,String.prototype.indexOf).
Norbert
Brendan Eich <mailto:[email protected]>
September 4, 2013 2:28 PM
8. Let first be the code unit value of the element at index positionin the String S.
This does not "[pass] through to charCodeAt()" literally, which wouldmean a call to S.charCodeAt(position). I thought that's what you meant.
So you want a code point index, not a code unit index. That would notbe useful for the lower-level purposes Allen identified. Again itseems you're trying to abstract away from all the details thatprobably will matter for string hackers using these APIs. But I summonNorbert at this point!
/be
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Anne van Kesteren <mailto:[email protected]>
September 4, 2013 12:51 PM
On Wed, Sep 4, 2013 at 5:34 PM, Brendan Eich<[email protected]> wrote:
Because of String.fromCharCode precedent. Balanced names with nounphrasesthat distinguish the "from" domains are better than longAndPortly vs.tiny.
I kinda liked it as analogue to what exists for Array and because
developers should probably move away from fromCharCode so the
precedent does not matter that much.
Sure, but you wanted to reduce "three concepts" and I don't see howto do
that. Most developers can ignore UTF-8, for sure.
The three concepts are: 16-bit code units, code points, and Unicode
scalar values. JavaScript, DOM, etc. deal with 16-bit code units.
utf-8 et al deal with Unicode scalar values. Nothing, apart from this
API, does code points at the moment.
Probably I just misunderstood what you meant, and you were simplypointing
out that lone surrogates arise only from legacy APIs?
No, they arise from this API.
Here, from the latest ES6 draft, is 15.5.2.3 String.fromCodePoint (
...codePoints):

No exposed surrogates here!
Mathias covered this.
Here's the spec for String.prototype.codePointAt:
8. Let first be the code unit value of the element at index positionin the
String S.
11. If second< 0xDC00 or second> 0xDFFF, then return first.

I take it you are objecting to step 11?
And step 8. The indexing is based on code units so you cannot actually
do indexing easily. You'd need to use the iterator to iterate over a
string getting only code points out.
The indexing of codePointAt() is also kind of sad as it just passes
through to charCodeAt(),
I don't see that in the spec cited above.
How do you read step 8?


Brendan Eich <mailto:[email protected]>
September 4, 2013 9:34 AM
Anne van Kesteren <mailto:[email protected]>
September 4, 2013 9:06 AM
On Wed, Sep 4, 2013 at 4:58 PM, Brendan Eich<[email protected]> wrote:
String.fromCodePoint, rather.
Oops. Any reason this is not just String.from() btw? Give the better
method a nice short name?
Because of String.fromCharCode precedent. Balanced names with nounphrases that distinguish the "from" domains are better thanlongAndPortly vs. tiny.
I'm not sure I'm a big fan of having all three concepts around.
You can't avoid it: UTF-8 is a transfer format that can be observed via
serialization.
Yes, but it cannot encode lone surrogates. It can only deal in Unicode
scalar values.
Sure, but you wanted to reduce "three concepts" and I don't see how todo that. Most developers can ignore UTF-8, for sure.
Probably I just misunderstood what you meant, and you were simplypointing out that lone surrogates arise only from legacy APIs?
String.prototype.charCodeAt and String.fromCharCode are
required for backward compatibility. And ES6 wants to expose codepoints as
well, so three.
Unicode scalar values are code points sans surrogates, i.e. completely
compatible with what a utf-8 encoder/decoder pair can handle.

Why do you want to expose surrogates?
I'm not sure I do! Sounds scandalous. :-P
Here, from the latest ES6 draft, is 15.5.2.3 String.fromCodePoint (...codePoints):
The String.fromCodePoint function may be called with a variable numberof arguments which form the
rest parameter codePoints. The following steps are taken:
1. Assert: codePoints is a well-formed rest parameter object.
2. Let length be the result of Get(codePoints, "length").
3. Let elements be a new List.
4. Let nextIndex be 0.
5. Repeat while nextIndex < length
a. Let next be the result of Get(codePoints, ToString(nextIndex)).
b. Let nextCP be ToNumber(next).
c. ReturnIfAbrupt(nextCP).
d. If SameValue(nextCP, ToInteger(nextCP)) is false,then throw aRangeError exception.
e. If nextCP < 0 or nextCP > 0x10FFFF, then throw a RangeError exception.
f. Append the elements of the UTF-16 Encoding (clause 6) of nextCP tothe end of elements.
g. Let nextIndex be nextIndex + 1.
6. Return the String value whose elements are, in order, the elementsin the List elements. If length is 0, the
empty string is returned.


No exposed surrogates here!

Here's the spec for String.prototype.codePointAt:
When the codePointAt method is called with one argument pos, thefollowing steps are taken:
1. Let O be CheckObjectCoercible(this value).
2. Let S be ToString(O).
3. ReturnIfAbrupt(S).
4. Let position be ToInteger(pos).
5. ReturnIfAbrupt(position).
6. Let size be the number of elements in S.
7. If position < 0 or position ≥ size, return undefined.
8. Let first be the code unit value of the element at index positionin the String S.9. If first < 0xD800 or first > 0xDBFF or position+1 = size, thenreturn first.10. Let second be the code unit value of the element at indexposition+1 in the String S.
11. If second < 0xDC00 or second > 0xDFFF, then return first.
12. Return ((first – 0xD800) × 1024) + (second – 0xDC00) + 0x10000.
NOTE The codePointAt function is intentionally generic; it does notrequire that its this value be a String object.Therefore it can be transferred to other kinds of objects for use as amethod.
I take it you are objecting to step 11?
Sorry, I missed this: how else (other than the charCodeAt/fromCharCode
legacy) are lone surrogates exposed?
"\udfff".codePointAt(0) == "\udfff"

It seems better if that returns "\ufffd", as you'd get with utf-8
(assuming it accepts code points as input rather than just Unicode
scalar values, in which case it'd throw).
Maybe. Allen and Norbert should weigh in.
The indexing of codePointAt() is also kind of sad as it just passes
through to charCodeAt(),
I don't see that in the spec cited above.

/be
which means for any serious usage you need to
use the iterator anyway. What's the reason codePointAt() exists?


Brendan Eich <mailto:[email protected]>
September 4, 2013 8:58 AM
Anne van Kesteren <mailto:[email protected]>
September 4, 2013 7:48 AM
ES6 introduces String.prototype.codePointAt() and
String.codePointFrom()
String.fromCodePoint, rather.
as well as an iterator (not defined). It struck
me this is the only place in the platform where we'd expose code point
as a concept to developers.

Nowadays strings are either 16-bit code units (JavaScript, DOM, etc.)
or Unicode scalar values (anytime you hit the network and use utf-8).

I'm not sure I'm a big fan of having all three concepts around.
You can't avoid it: UTF-8 is a transfer format that can be observedvia serialization. String.prototype.charCodeAt andString.fromCharCode are required for backward compatibility. And ES6wants to expose code points as well, so three.
We
could have String.prototype.unicodeAt() and String.unicodeFrom()
instead, and have them translate lone surrogates into U+FFFD. Lone
surrogates are a bug and I don't see a reason to expose them in more
places than just the 16-bit code units.
Sorry, I missed this: how else (other than thecharCodeAt/fromCharCode legacy) are lone surrogates exposed?
/be
Anne van Kesteren <mailto:[email protected]>
September 4, 2013 7:48 AM
ES6 introduces String.prototype.codePointAt() and
String.codePointFrom() as well as an iterator (not defined). It struck
me this is the only place in the platform where we'd expose code point
as a concept to developers.

Nowadays strings are either 16-bit code units (JavaScript, DOM, etc.)
or Unicode scalar values (anytime you hit the network and use utf-8).

I'm not sure I'm a big fan of having all three concepts around. We
could have String.prototype.unicodeAt() and String.unicodeFrom()
instead, and have them translate lone surrogates into U+FFFD. Lone
surrogates are a bug and I don't see a reason to expose them in more
places than just the 16-bit code units.
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss
Anne van Kesteren <mailto:[email protected]>
September 4, 2013 9:06 AM
On Wed, Sep 4, 2013 at 4:58 PM, Brendan Eich<[email protected]> wrote:
String.fromCodePoint, rather.
Oops. Any reason this is not just String.from() btw? Give the better
method a nice short name?
I'm not sure I'm a big fan of having all three concepts around.
You can't avoid it: UTF-8 is a transfer format that can be observed via
serialization.
Yes, but it cannot encode lone surrogates. It can only deal in Unicode
scalar values.
String.prototype.charCodeAt and String.fromCharCode are
required for backward compatibility. And ES6 wants to expose codepoints as
well, so three.
Unicode scalar values are code points sans surrogates, i.e. completely
compatible with what a utf-8 encoder/decoder pair can handle.

Why do you want to expose surrogates?
Sorry, I missed this: how else (other than the charCodeAt/fromCharCode
legacy) are lone surrogates exposed?
"\udfff".codePointAt(0) == "\udfff"

It seems better if that returns "\ufffd", as you'd get with utf-8
(assuming it accepts code points as input rather than just Unicode
scalar values, in which case it'd throw).

The indexing of codePointAt() is also kind of sad as it just passes
through to charCodeAt(), which means for any serious usage you need to
use the iterator anyway. What's the reason codePointAt() exists?

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Code points vs Unicode scalar values

Reply via email to