Anne van Kesteren <mailto:[email protected]>
September 11, 2013 3:43 AM

It's not clear the arguments were carefully considered though. Shawn
Steele raised the same concerns I did. The unicode.org thread also
suggests that the ideal value space for a string is Unicode scalar
values (i.e. what utf-8 can do) and not code points. It did indeed
indicate they have code points because of legacy, but JavaScript has
16-bit code units due to legacy. If we're going to offer a higher
level of abstraction over the basic string type, we can very well make
that a utf-8 safe layer.

You could be right, but this is a deep topic, not sorted out by programming language developers, in my view. It came up recently here:

http://www.haskell.org/pipermail/haskell-cafe/2013-September/108654.html

That thread continues. The point about C winning because it doesn't have an abstract String type, only char[], is winning in my view. Yes, it's low level and you have to cope with multiple encodings, but any attempt at a more abstract view would have made a badly leaky abstraction, which would have been more of a boat anchor.

/be

If you need anything for tests, you can just ignore the higher level
of abstraction and operate on 16-bit code units instead.


Brendan Eich <mailto:[email protected]>
September 5, 2013 2:08 PM
Thanks for the reminders -- we've been over this.

/be

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss
Norbert Lindenberg <mailto:[email protected]>
September 5, 2013 12:07 PM

Previous discussion of allowing surrogate code points:
https://mail.mozilla.org/pipermail/es-discuss/2012-December/thread.html#27057
https://mail.mozilla.org/pipermail/es-discuss/2013-January/thread.html#28086
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/thread.html#29

Essentially, ECMAScript strings are Unicode strings as defined in The Unicode Standard section 2.7, and thus may contain unpaired surrogate code units in their 16-bit form or surrogate code points when interpreted as 32-bit sequences. String.fromCodePoint and String.prototype.codePointAt just convert between 16-bit and 32-bit forms; they're not meant to interpret the code points beyond that, and some processing (such as test cases) may depend on them being preserved. This is different from encoding for communication over networks, where the use of valid UTF-8 or UTF-16 (which cannot contain surrogate code points) is generally required.

The indexing issue was first discussed in the form "why can't we just use UTF-32"? See
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#UTF32
for pointers to that. It would have been great to use UTF-8, but it's unfortunately not compatible with the past and the DOM.

Adding code point indexing to 16-bit code unit strings would add significant performance overhead. In reality, whether an index is for 16-bit or 32-bit units matters only for some relatively low-level software that needs to process code point by code point. A lot of software deals with complete strings without ever looking inside, or is fine processing code unit by code unit (e.g., String.prototype.indexOf).

Norbert
Brendan Eich <mailto:[email protected]>
September 4, 2013 2:28 PM


8. Let first be the code unit value of the element at index position in the String S.

This does not "[pass] through to charCodeAt()" literally, which would mean a call to S.charCodeAt(position). I thought that's what you meant.

So you want a code point index, not a code unit index. That would not be useful for the lower-level purposes Allen identified. Again it seems you're trying to abstract away from all the details that probably will matter for string hackers using these APIs. But I summon Norbert at this point!

/be
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Anne van Kesteren <mailto:[email protected]>
September 4, 2013 12:51 PM
On Wed, Sep 4, 2013 at 5:34 PM, Brendan Eich<[email protected]>  wrote:
Because of String.fromCharCode precedent. Balanced names with noun phrases
that distinguish the "from" domains are better than longAndPortly vs. tiny.

I kinda liked it as analogue to what exists for Array and because
developers should probably move away from fromCharCode so the
precedent does not matter that much.


Sure, but you wanted to reduce "three concepts" and I don't see how to do
that. Most developers can ignore UTF-8, for sure.

The three concepts are: 16-bit code units, code points, and Unicode
scalar values. JavaScript, DOM, etc. deal with 16-bit code units.
utf-8 et al deal with Unicode scalar values. Nothing, apart from this
API, does code points at the moment.


Probably I just misunderstood what you meant, and you were simply pointing
out that lone surrogates arise only from legacy APIs?

No, they arise from this API.


Here, from the latest ES6 draft, is 15.5.2.3 String.fromCodePoint (
...codePoints):

No exposed surrogates here!

Mathias covered this.


Here's the spec for String.prototype.codePointAt:

8. Let first be the code unit value of the element at index position in the
String S.
11. If second<  0xDC00 or second>  0xDFFF, then return first.

I take it you are objecting to step 11?

And step 8. The indexing is based on code units so you cannot actually
do indexing easily. You'd need to use the iterator to iterate over a
string getting only code points out.


The indexing of codePointAt() is also kind of sad as it just passes
through to charCodeAt(),
I don't see that in the spec cited above.

How do you read step 8?


_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to