Re: Code points vs Unicode scalar values

Brendan Eich Wed, 11 Sep 2013 12:40:43 -0700

Anne van Kesteren <mailto:[email protected]>
September 11, 2013 3:43 AM


It's not clear the arguments were carefully considered though. Shawn
Steele raised the same concerns I did. The unicode.org thread also
suggests that the ideal value space for a string is Unicode scalar
values (i.e. what utf-8 can do) and not code points. It did indeed
indicate they have code points because of legacy, but JavaScript has
16-bit code units due to legacy. If we're going to offer a higher
level of abstraction over the basic string type, we can very well make
that a utf-8 safe layer.

You could be right, but this is a deep topic, not sorted out byprogramming language developers, in my view. It came up recently here:


http://www.haskell.org/pipermail/haskell-cafe/2013-September/108654.html

That thread continues. The point about C winning because it doesn't havean abstract String type, only char[], is winning in my view. Yes, it'slow level and you have to cope with multiple encodings, but any attemptat a more abstract view would have made a badly leaky abstraction, whichwould have been more of a boat anchor.

/be

If you need anything for tests, you can just ignore the higher level
of abstraction and operate on 16-bit code units instead.


Brendan Eich <mailto:[email protected]>
September 5, 2013 2:08 PM
Thanks for the reminders -- we've been over this.

/be

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss
Norbert Lindenberg <mailto:[email protected]>
September 5, 2013 12:07 PM

Previous discussion of allowing surrogate code points:
https://mail.mozilla.org/pipermail/es-discuss/2012-December/thread.html#27057
https://mail.mozilla.org/pipermail/es-discuss/2013-January/thread.html#28086
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/thread.html#29
Essentially, ECMAScript strings are Unicode strings as defined in TheUnicode Standard section 2.7, and thus may contain unpaired surrogatecode units in their 16-bit form or surrogate code points wheninterpreted as 32-bit sequences. String.fromCodePoint andString.prototype.codePointAt just convert between 16-bit and 32-bitforms; they're not meant to interpret the code points beyond that, andsome processing (such as test cases) may depend on them beingpreserved. This is different from encoding for communication overnetworks, where the use of valid UTF-8 or UTF-16 (which cannot containsurrogate code points) is generally required.
The indexing issue was first discussed in the form "why can't we justuse UTF-32"? See
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#UTF32
for pointers to that. It would have been great to use UTF-8, but it'sunfortunately not compatible with the past and the DOM.
Adding code point indexing to 16-bit code unit strings would addsignificant performance overhead. In reality, whether an index is for16-bit or 32-bit units matters only for some relatively low-levelsoftware that needs to process code point by code point. A lot ofsoftware deals with complete strings without ever looking inside, oris fine processing code unit by code unit (e.g.,String.prototype.indexOf).
Norbert
Brendan Eich <mailto:[email protected]>
September 4, 2013 2:28 PM
8. Let first be the code unit value of the element at index positionin the String S.
This does not "[pass] through to charCodeAt()" literally, which wouldmean a call to S.charCodeAt(position). I thought that's what you meant.
So you want a code point index, not a code unit index. That would notbe useful for the lower-level purposes Allen identified. Again itseems you're trying to abstract away from all the details thatprobably will matter for string hackers using these APIs. But I summonNorbert at this point!
/be
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Anne van Kesteren <mailto:[email protected]>
September 4, 2013 12:51 PM
On Wed, Sep 4, 2013 at 5:34 PM, Brendan Eich<[email protected]>  wrote:
Because of String.fromCharCode precedent. Balanced names with noun phrases
that distinguish the "from" domains are better than longAndPortly vs. tiny.
I kinda liked it as analogue to what exists for Array and because
developers should probably move away from fromCharCode so the
precedent does not matter that much.
Sure, but you wanted to reduce "three concepts" and I don't see how to do
that. Most developers can ignore UTF-8, for sure.
The three concepts are: 16-bit code units, code points, and Unicode
scalar values. JavaScript, DOM, etc. deal with 16-bit code units.
utf-8 et al deal with Unicode scalar values. Nothing, apart from this
API, does code points at the moment.
Probably I just misunderstood what you meant, and you were simply pointing
out that lone surrogates arise only from legacy APIs?
No, they arise from this API.
Here, from the latest ES6 draft, is 15.5.2.3 String.fromCodePoint (
...codePoints):

No exposed surrogates here!
Mathias covered this.
Here's the spec for String.prototype.codePointAt:

8. Let first be the code unit value of the element at index position in the
String S.
11. If second<  0xDC00 or second>  0xDFFF, then return first.

I take it you are objecting to step 11?
And step 8. The indexing is based on code units so you cannot actually
do indexing easily. You'd need to use the iterator to iterate over a
string getting only code points out.
The indexing of codePointAt() is also kind of sad as it just passes
through to charCodeAt(),
I don't see that in the spec cited above.
How do you read step 8?

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Code points vs Unicode scalar values

Reply via email to