On Feb 21, 2012, at 6:05 PM, Norbert Lindenberg <ecmascr...@norbertlindenberg.com> wrote:
> I'll reply to Brendan's proposal in two parts: first about the goals for > supplementary character support, second about the BRS. > >> Full 21-bit Unicode support means all of: >> >> * indexing by characters, not uint16 storage units; >> * counting length as one greater than the last index; and >> * supporting escapes with (up to) six hexadecimal digits. > > For me, full 21-bit Unicode support has a different priority list. > > First come the essentials: Regular expressions; functions that interpret > strings; the overall sense that all Unicode characters are supported. > > 1) Regular expressions must recognize supplementary characters as atomic > entities, and interpret them according to Unicode semantics. Sorry to have been unclear. In my proposal this follows from the first two bullets. > 2) Built-in functions that interpret strings have to recognize supplementary > characters as atomic entities and interpret them according to their Unicode > semantics. Ditto. > 3) It must be clear that the full Unicode character set is allowed and > supported. Absolutely. > Only after these essentials come the niceties of String representation and > Unicode escapes: > > 4) 1 String element to 1 Unicode code point is indeed a very nice and > desirable relationship. Unlike Java, where binary compatibility between > virtual machines made a change from UTF-16 to UTF-32 impossible, JavaScript > needs to be compatible only at the source code level - or maybe, with a BRS, > not even that. Right! > 5) If we don't go for UTF-32, then there should be a few functions to > simplify access to strings in terms of code points, such as > String.fromCodePoint, String.prototype.codePointAt. Those would help smooth out different BRS settings, indeed. > 6) I strongly prefer the use of plain characters over Unicode escapes in > source code, because plain text is much easier to read than sequences of hex > values. However, the need for Unicode escapes is greater in the space of > supplementary characters because here we often have to reference characters > for which our operating systems don't have glyphs yet. And \u{1D11E} > certainly makes it easier to cross-reference a character than \uD834\uDD1E. > The new escape syntax therefore should be on the list, at low priority. Allen and I were just discussing this as a desirable mini- strawman of its own, which Allen will write up for consideration at the next meeting. We will also discuss the BRS . Did you have some thoughts on it? > I think it would help if other people involved in this discussion also > clarified what exactly their requirements are for "full Unicode support". Again, apologies for not being explicit. I model the string methods as self-hosted using indexing and .length in straightforward ways. HTH, /be > > Norbert > > > > On Feb 19, 2012, at 0:33 , Brendan Eich wrote: > >> Once more unto the breach, dear friends! >> >> ES1 dates from when Unicode fit in 16 bits, and in those days, nickels had >> pictures of bumblebees on 'em ("Gimme five bees for a quarter", you'd say >> ;-). >> >> Clearly that was a while ago. These days, we would like full 21-bit Unicode >> character support in JS. Some (mranney at Voxer) contend that it is a >> requirement. >> >> Full 21-bit Unicode support means all of: >> >> * indexing by characters, not uint16 storage units; >> * counting length as one greater than the last index; and >> * supporting escapes with (up to) six hexadecimal digits. >> >> ES4 saw bold proposals including Lars Hansen's, to allow implementations to >> change string indexing and length incompatibly, and let Darwin sort it out. >> I recall that was when we agreed to support "\u{XXXXXX}" as an extension for >> spelling non-BMP characters. >> >> Allen's strawman from last year, >> http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings, >> proposed a brute-force change to support full Unicode (albeit with too many >> hex digits allowed in "\u{...}"), observing that "There are very few places >> where the ECMAScript specification has actual dependencies upon the size of >> individual characters so the compatibility impact of supporting full Unicode >> is quite small." But two problems remained: >> >> P1. As Allen wrote, "There is a larger impact on actual implementations", >> and no implementors that I can recall were satisfied that the cost was >> acceptable. It might be, we just didn't know, and there are enough signs of >> high cost to create this concern. >> >> P2. The change is not backward compatible. In JS today, one read a string s >> from somewhere and hard-code, e.g., s.indexOf("0xd800" to find part of a >> surrogate pair, then advance to the next-indexed uint16 unit and read the >> other half, then combine to compute some result. Such usage would break. >> >> Example from Allen: >> >> var c = "😁" // where the single character between the quotes is the Unicode >> character U+1f638 >> >> c.length == 2; >> c === "\ud83d\ude38"; //the two character UTF-16 encoding of 0x1f683 >> c.charCodeAt(0) == 0xd83d; >> c.charCodeAt(1) == 0xd338; >> >> (Allen points out how browsers, node.js, and other environments blindly >> handle UTF-8 or whatever incoming format recoding to UTF-16 upstream of the >> JS engine, so the above actually works without any spec-language in ECMA-262 >> saying it should.) >> >> So based on a recent twitter/github exchange, gist recorded at >> https://gist.github.com/1850768, I would like to propose a variation on >> Allen's proposal that resolves both of these problems. Here are resolutions >> in reverse order: >> >> R2. No incompatible change without opt-in. If you hardcode as in Allen's >> example, don't opt in without changing your index, length, and char/code-at >> assumptions. >> >> Such opt-in cannot be a pragma since those have lexical scope and affect >> code, not the heap where strings and String.prototype methods live. >> >> We also wish to avoid exposing a "full Unicode" representation type and >> duplicated suite of the String static and prototype methods, as Java did. >> (We may well want UTF-N transcoding helpers; we certainly want ByteArray <-> >> UTF-8 transcoding APIs.) >> >> True, R2 implies there are two string primitive representations at most, or >> more likely "1.x" for some fraction .x. Say, a flag bit in the string header >> to distinguish JS's uint16-based indexing ("UCS-2") from non-O(1)-indexing >> UTF-16. Lots of non-observable implementation options here. >> >> Instead of any such *big* new observables, I propose a so-called "Big Red >> [opt-in] Switch" (BRS) on the side of a unit of VM isolation: specifically >> the global object. >> >> Why the global object? Because for many VMs, each global has its own heap or >> sub-heap ("compartment"), and all references outside that heap are to local >> proxies that copy from, or in the case of immutable data, reference the >> remote heap. Also because inter-compartment traffic is (we conjecture) >> infrequent enough to tolerate the proxy/copy overhead. >> >> For strings and String objects, such proxies would consult the remote heap's >> BRS setting and transcode indexed access, and .length gets, accordingly. It >> doesn't matter if the BRS is in the global or its String constructor or >> String.prototype, as the latter are unforgeably linked to the global. >> >> This means a script intent on comparing strings from two globals with >> different BRS settings could indeed tell that one discloses non-BMP >> char/codes, e.g. charCodeAt return values >= 0x10000. This is the *small* >> new observable I claim we can live with, because someone opted into it at >> least in one of the related global objects. >> >> Note that implementations such as Node.js can pre-set the BRS to "full >> Unicode" at startup. Embeddings that fully isolate each global and its >> reachable objects and strings pay no string-proxy or -copy overhead. >> >> R1. To keep compatibility with DOM APIs, the DOM glue used to mediate calls >> from JS to (typically) C++ would have to proxy or copy any strings >> containing non-BMP characters. Strings with only BMP characters would work >> as today. >> >> Note that we are dealing only in spec observables here. It doesn't matter >> whether the JS engine uses UTF-8 and the DOM UCS-2 (in which case there is >> already a transcoding penalty; IIRC WebKit libxml and libxslt use UTF-8 and >> so must transcode to interface with WebKit's DOM). The only issue at this >> boundary, I believe, is how indexing and .length work. >> >> Ok, there you have it: resolutions for both problems that killed the last >> assault on Castle '90s-JS. >> >> Implementations that use uint16 vectors as the character data representation >> type for both "UCS-2" and "UTF-16" string variants would probably want >> another flag bit per string header indicating whether, for the UTF-16 case, >> the string indeed contained any non-BMP characters. If not, no proxy/copy >> needed. >> >> Such implementations probably would benefit from string (primitive value) >> proxies not just copies, since the underlying uint16 vector could be shared >> by two different string headers with whatever metadata flag bits, etc., are >> needed to disclose different length values, access different methods from >> distinct globals' String.prototype objects, etc. >> >> We could certainly also work with the W3C to revise the DOM to check the BRS >> setting, if that is possible, to avoid this non-BMP-string proxy/copy >> overhead. >> >> How is the BRS configured? Again, not via a pragma, and not by imperative >> state update inside the language (mutating hidden BRS state at a given >> program point could leave strings created before mutation observably >> different from those created after, unless the implementation in effect >> scanned the local heap and wrapped or copied any non-BMP-char-bearing ones >> creatd before). >> >> The obvious way to express the BRS in HTML is a <meta> tag in document >> <head>, but I don't want to get hung up on this point. I do welcome expert >> guidance. Here is another W3C/WHATWG interaction point. For this reason I'm >> cc'ing public-script-coord. >> >> The upshot of this proposal is to get JS out of the '90s without a mandatory >> breaking change. With simple-enough opt-in expressed at coarse-enough >> boundaries so as not to impose high cost or unintended string type confusion >> bugs, the complexity is mostly borne by implementors, and at less than a 2x >> cost comparing string implementations (I think -- demonstration required of >> course). >> >> In particular, Node.js can get modern at startup, and perhaps engines such >> as V8 as used in Node could even support compile-time (#ifdef) configury by >> which to support only full Unicode. >> >> Comments welcome. >> >> /be >> >> _______________________________________________ >> es-discuss mailing list >> es-discuss@mozilla.org >> https://mail.mozilla.org/listinfo/es-discuss > _______________________________________________ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss