On 17 May 2011 12:36, Boris Zbarsky <[email protected]> wrote: > Not quite: code points D800-DFFF are reserved code points which are not > >> representable with UTF-16. >> > > Nor with any other Unicode encoding, really. They don't represent, on > their own, Unicode characters. >
Right - but they are still legitimate code points, and they fill out the space required to let us treat String as uint16[] when defining the backing store as "something that maps to the set of all Unicode code points". That said, you can encode these code points with utf-8; for example, 0xdc08 becomes 0xed 0xb0 0x88. No, you're allowing storage of some sort of number arrays that don't > represent Unicode strings at all. > No, if I understand Allen's proposal correctly, we're allowing storage of some sort of number arrays that may contain reserved code points, some of which cannot be represented in UTF-16. This isn't that different from the status quo; it is possible right now to generate JS Strings which are not valid UTF-16 by creating invalid surrogate pairs. Keep in mind, also, that even a sequence of random bytes is a valid Unicode string. The standard does not require that they be well-formed. (D80) > Right, so if it's looking for non-BMP characters in the string, say, > instead of computing the length, it won't find them. How the heck is that > "just works"? > My untested hypothesis is that the vast majority of JS code looking for non-BMP characters is looking for them in order to call them out for special processing, because the code unit and code point size are different. When they don't need special processing, they don't need to be found. Since the high-surrogate code points do not appear in well-formed Unicode strings, they will not be found, and the unneeded special processing will not happen. This train of clauses forms the basis for my opinion that, for the majority of folks, things will "just work". > What would that even mean? DOMString is defined to be an ES string in the > ES binding right now. Is the proposal to have some other kind of object for > DOMString (so that, for example, String.prototype would no longer affect the > behavior of DOMString the way it does now)? > Wait, are DOMStrings formally UTF-16, or are they ES Strings? > > This might mean that it is possible that >> JSString=>DOMString would throw, as full Unicode Strings could contain >> code points which are not representable in UTF-16. >> > > How is that different from sticking non-UTF-16 into an ES string right now? > Currently, JS Strings are effectively arrays of 16-bit code units, which are indistinguishable from 16-bit Unicode strings (D82). This means that a JS application can use JS Strings as arrays of uint16, and expect to be able to round-trip all strings, even those which are not well-formed, through a UTF-16 DOM. If we redefine JS Strings to be arrays of Unicode code points, then the JS application can use JS Strings as arrays uint21 -- but round-tripping the high-surrogate code points through a UTF-16 layer would not work. > > It might mean extra copying, or it might not if the DOM implementation >> already uses >> UTF-8 internally. >> > > Uh... what does UTF-8 have to do with this? > If you're already storing UTF-8 strings internally, then you are already doing something "expensive" (like copying) to get their code units into and out of JS; so no incremental perf impact by not having a common UTF-16 backing store. > (As a note, Gecko and WebKit both use UTF-16 internally; I would be > _really_ surprised if Trident does not. No idea about Presto.) FWIW - last I time I scanned the v8 sources, it appeared to use a three-representation class, which could store either ASCII, UCS2, or UTF-8. Presumably ASCII could also be ISO-Latin-1, as both are exact, naive, byte-sized UCS2/UTF-16 subsets. Wes -- Wesley W. Garland Director, Product Development PageMail, Inc. +1 613 542 2787 x 102
_______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

