Re: New full Unicode for ES6 idea

Brendan Eich Tue, 21 Feb 2012 18:24:42 -0800

On Feb 21, 2012, at 6:05 PM, Norbert Lindenberg 
<ecmascr...@norbertlindenberg.com> wrote:


> I'll reply to Brendan's proposal in two parts: first about the goals for 
> supplementary character support, second about the BRS.
> 
>> Full 21-bit Unicode support means all of:
>> 
>> * indexing by characters, not uint16 storage units;
>> * counting length as one greater than the last index; and
>> * supporting escapes with (up to) six hexadecimal digits.
> 
> For me, full 21-bit Unicode support has a different priority list.
> 
> First come the essentials: Regular expressions; functions that interpret 
> strings; the overall sense that all Unicode characters are supported.
> 
> 1) Regular expressions must recognize supplementary characters as atomic 
> entities, and interpret them according to Unicode semantics.

Sorry to have been unclear. In my proposal this follows from the first two 
bullets.


> 2) Built-in functions that interpret strings have to recognize supplementary 
> characters as atomic entities and interpret them according to their Unicode 
> semantics.

Ditto.


> 3) It must be clear that the full Unicode character set is allowed and 
> supported. 

Absolutely.


> Only after these essentials come the niceties of String representation and 
> Unicode escapes:
> 
> 4) 1 String element to 1 Unicode code point is indeed a very nice and 
> desirable relationship. Unlike Java, where binary compatibility between 
> virtual machines made a change from UTF-16 to UTF-32 impossible, JavaScript 
> needs to be compatible only at the source code level - or maybe, with a BRS, 
> not even that.

Right!


> 5) If we don't go for UTF-32, then there should be a few functions to 
> simplify access to strings in terms of code points, such as 
> String.fromCodePoint, String.prototype.codePointAt.

Those would help smooth out different BRS settings, indeed.


> 6) I strongly prefer the use of plain characters over Unicode escapes in 
> source code, because plain text is much easier to read than sequences of hex 
> values. However, the need for Unicode escapes is greater in the space of 
> supplementary characters because here we often have to reference characters 
> for which our operating systems don't have glyphs yet. And \u{1D11E} 
> certainly makes it easier to cross-reference a character than \uD834\uDD1E. 
> The new escape syntax therefore should be on the list, at low priority.

Allen and I were just discussing this as a desirable mini- strawman of its own, 
which Allen will write up for consideration at the next meeting.

We will also discuss the BRS . Did you have some thoughts on it?


> I think it would help if other people involved in this discussion also 
> clarified what exactly their requirements are for "full Unicode support".

Again, apologies for not being explicit. I model the string methods as 
self-hosted using indexing and .length in straightforward ways. HTH,

/be

> 
> Norbert
> 
> 
> 
> On Feb 19, 2012, at 0:33 , Brendan Eich wrote:
> 
>> Once more unto the breach, dear friends!
>> 
>> ES1 dates from when Unicode fit in 16 bits, and in those days, nickels had 
>> pictures of bumblebees on 'em ("Gimme five bees for a quarter", you'd say 
>> ;-).
>> 
>> Clearly that was a while ago. These days, we would like full 21-bit Unicode 
>> character support in JS. Some (mranney at Voxer) contend that it is a 
>> requirement.
>> 
>> Full 21-bit Unicode support means all of:
>> 
>> * indexing by characters, not uint16 storage units;
>> * counting length as one greater than the last index; and
>> * supporting escapes with (up to) six hexadecimal digits.
>> 
>> ES4 saw bold proposals including Lars Hansen's, to allow implementations to 
>> change string indexing and length incompatibly, and let Darwin sort it out. 
>> I recall that was when we agreed to support "\u{XXXXXX}" as an extension for 
>> spelling non-BMP characters.
>> 
>> Allen's strawman from last year, 
>> http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings,
>>  proposed a brute-force change to support full Unicode (albeit with too many 
>> hex digits allowed in "\u{...}"), observing that "There are very few places 
>> where the ECMAScript specification has actual dependencies upon the size of 
>> individual characters so the compatibility impact of supporting full Unicode 
>> is quite small." But two problems remained:
>> 
>> P1. As Allen wrote, "There is a larger impact on actual implementations", 
>> and no implementors that I can recall were satisfied that the cost was 
>> acceptable. It might be, we just didn't know, and there are enough signs of 
>> high cost to create this concern.
>> 
>> P2. The change is not backward compatible. In JS today, one read a string s 
>> from somewhere and hard-code, e.g., s.indexOf("0xd800" to find part of a 
>> surrogate pair, then advance to the next-indexed uint16 unit and read the 
>> other half, then combine to compute some result. Such usage would break.
>> 
>> Example from Allen:
>> 
>> var c = "😁" // where the single character between the quotes is the Unicode 
>> character U+1f638
>> 
>> c.length == 2;
>> c === "\ud83d\ude38"; //the two character UTF-16 encoding of 0x1f683
>> c.charCodeAt(0) == 0xd83d;
>> c.charCodeAt(1) == 0xd338;
>> 
>> (Allen points out how browsers, node.js, and other environments blindly 
>> handle UTF-8 or whatever incoming format recoding to UTF-16 upstream of the 
>> JS engine, so the above actually works without any spec-language in ECMA-262 
>> saying it should.)
>> 
>> So based on a recent twitter/github exchange, gist recorded at 
>> https://gist.github.com/1850768, I would like to propose a variation on 
>> Allen's proposal that resolves both of these problems. Here are resolutions 
>> in reverse order:
>> 
>> R2. No incompatible change without opt-in. If you hardcode as in Allen's 
>> example, don't opt in without changing your index, length, and char/code-at 
>> assumptions.
>> 
>> Such opt-in cannot be a pragma since those have lexical scope and affect 
>> code, not the heap where strings and String.prototype methods live.
>> 
>> We also wish to avoid exposing a "full Unicode" representation type and 
>> duplicated suite of the String static and prototype methods, as Java did. 
>> (We may well want UTF-N transcoding helpers; we certainly want ByteArray <-> 
>> UTF-8 transcoding APIs.)
>> 
>> True, R2 implies there are two string primitive representations at most, or 
>> more likely "1.x" for some fraction .x. Say, a flag bit in the string header 
>> to distinguish JS's uint16-based indexing ("UCS-2") from non-O(1)-indexing 
>> UTF-16. Lots of non-observable implementation options here.
>> 
>> Instead of any such *big* new observables, I propose a so-called "Big Red 
>> [opt-in] Switch" (BRS) on the side of a unit of VM isolation: specifically 
>> the global object.
>> 
>> Why the global object? Because for many VMs, each global has its own heap or 
>> sub-heap ("compartment"), and all references outside that heap are to local 
>> proxies that copy from, or in the case of immutable data, reference the 
>> remote heap. Also because inter-compartment traffic is (we conjecture) 
>> infrequent enough to tolerate the proxy/copy overhead.
>> 
>> For strings and String objects, such proxies would consult the remote heap's 
>> BRS setting and transcode indexed access, and .length gets, accordingly. It 
>> doesn't matter if the BRS is in the global or its String constructor or 
>> String.prototype, as the latter are unforgeably linked to the global.
>> 
>> This means a script intent on comparing strings from two globals with 
>> different BRS settings could indeed tell that one discloses non-BMP 
>> char/codes, e.g. charCodeAt return values >= 0x10000. This is the *small* 
>> new observable I claim we can live with, because someone opted into it at 
>> least in one of the related global objects.
>> 
>> Note that implementations such as Node.js can pre-set the BRS to "full 
>> Unicode" at startup. Embeddings that fully isolate each global and its 
>> reachable objects and strings pay no string-proxy or -copy overhead.
>> 
>> R1. To keep compatibility with DOM APIs, the DOM glue used to mediate calls 
>> from JS to (typically) C++ would have to proxy or copy any strings 
>> containing non-BMP characters. Strings with only BMP characters would work 
>> as today.
>> 
>> Note that we are dealing only in spec observables here. It doesn't matter 
>> whether the JS engine uses UTF-8 and the DOM UCS-2 (in which case there is 
>> already a transcoding penalty; IIRC WebKit libxml and libxslt use UTF-8 and 
>> so must transcode to interface with WebKit's DOM). The only issue at this 
>> boundary, I believe, is how indexing and .length work.
>> 
>> Ok, there you have it: resolutions for both problems that killed the last 
>> assault on Castle '90s-JS.
>> 
>> Implementations that use uint16 vectors as the character data representation 
>> type for both "UCS-2" and "UTF-16" string variants would probably want 
>> another flag bit per string header indicating whether, for the UTF-16 case, 
>> the string indeed contained any non-BMP characters. If not, no proxy/copy 
>> needed.
>> 
>> Such implementations probably would benefit from string (primitive value) 
>> proxies not just copies, since the underlying uint16 vector could be shared 
>> by two different string headers with whatever metadata flag bits, etc., are 
>> needed to disclose different length values, access different methods from 
>> distinct globals' String.prototype objects, etc.
>> 
>> We could certainly also work with the W3C to revise the DOM to check the BRS 
>> setting, if that is possible, to avoid this non-BMP-string proxy/copy 
>> overhead.
>> 
>> How is the BRS configured? Again, not via a pragma, and not by imperative 
>> state update inside the language (mutating hidden BRS state at a given 
>> program point could leave strings created before mutation observably 
>> different from those created after, unless the implementation in effect 
>> scanned the local heap and wrapped or copied any non-BMP-char-bearing ones 
>> creatd before).
>> 
>> The obvious way to express the BRS in HTML is a <meta> tag in document 
>> <head>, but I don't want to get hung up on this point. I do welcome expert 
>> guidance. Here is another W3C/WHATWG interaction point. For this reason I'm 
>> cc'ing public-script-coord.
>> 
>> The upshot of this proposal is to get JS out of the '90s without a mandatory 
>> breaking change. With simple-enough opt-in expressed at coarse-enough 
>> boundaries so as not to impose high cost or unintended string type confusion 
>> bugs, the complexity is mostly borne by implementors, and at less than a 2x 
>> cost comparing string implementations (I think -- demonstration required of 
>> course).
>> 
>> In particular, Node.js can get modern at startup, and perhaps engines such 
>> as V8 as used in Node could even support compile-time (#ifdef) configury by 
>> which to support only full Unicode.
>> 
>> Comments welcome.
>> 
>> /be
>> 
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss@mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
> 
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: New full Unicode for ES6 idea

Reply via email to