On Mon, May 16, 2011 at 2:19 PM, Mark Davis ☕ <m...@macchiato.com> wrote:
> I'm quite sympathetic to the goal, but the proposal does represent a > significant breaking change. The problem, as Shawn points out, is with > indexing. Before, the strings were defined as UTF16. I agree with Mark wrote except that the previous spec used UCS-2, which this proposal (and other proposals on the issue) try to rectify. I think that taking Java's approach would work better with DOMString as well. See W3C I18N WG's proposal<http://www.w3.org/International/wiki/JavaScriptInternationalization> on the issue and Java's approach<http://java.sun.com/developer/technicalArticles/Intl/Supplementary/>linked there) Jungshik > > Take a sample string "\ud800\udc00\u0061" = "\u{10000}\u{61}". Right now, > the 'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a' > would be at offset 1. This will definitely cause breakage in existing code; > characters are in different positions than they were, even characters that > are not supplemental ones. All it takes is one supplemental character before > the current position and the offsets will be off for the rest of the string. > > Faced with exactly the same problem, Java took a different approach that > allows for handling of the full range of Unicode characters, but maintains > backwards compatibility. It may be instructive to look at what they did > (although there was definitely room for improvement in their approach!). I > can follow up with that if people are interested. Alternatively, perhaps > mechanisms can put in place to tell ECMAScript to use new vs old indexing > (Perl uses PRAGMAs for that kind of thing, for example), although that has > its own ugliness. > > Mark > > *— Il meglio è l’inimico del bene —* > > > On Mon, May 16, 2011 at 13:38, Wes Garland <w...@page.ca> wrote: > >> Allen; >> >> Thanks for putting this together. We use Unicode data extensively in both >> our web and server-side applications, and being forced to deal with UTF-16 >> surrogate pair directly -- rather than letting the String implementation >> deal with them -- is a constant source of mild pain. At first blush, this >> proposal looks like it meets all my needs, and my gut tells me the perf >> impacts will probably be neutral or good. >> >> Two great things about strings composed of Unicode code points: >> 1) .length represents the number of code points, rather than the number of >> pairs used in UTF-16, even if the underlying representation isn't UTF-16 >> 2) S.charCodeAt(S.indexOf(X)) always returns the same kind of information >> (a Unicode code point), regardless of whether X is in the BMP or not >> >> If though this is a breaking change from ES-5, I support it >> whole-heartedly.... but I expect breakage to be very limited. Provided that >> the implementation does not restrict the storage of reserved code points >> (D800-DF00), it should be possible for users using String as immutable >> C-arrays to keep doing so. Users doing surrogate pair decomposition will >> probably find that their code "just works", as those code points will never >> appear in legitimate strings of Unicode code points. Users creating Strings >> with surrogate pairs will need to re-tool, but this is a small burden and >> these users will be at the upper strata of Unicode-foodom. I suspect that >> 99.99% of users will find that this change will fix bugs in their code when >> dealing with non-BMP characters. >> >> Mike Samuel, there would never a supplement code unit to match, as the >> return value of [[Get]] would be a code point. >> >> Shawn Steele, I don't understand this comment: >> >> Also, the “trick” I think, is encoding to surrogate pairs (illegally, >> since UTF8 doesn’t allow that) vs decoding to UTF16. >> >> >> Why do we care about the UTF-16 representation of particular codepoints? >> Why can't the new functions just encode the Unicode string as UTF-8 and URI >> escape it? >> >> Mike Samuel, can you explain why you are en/decoding UTF-16 when >> round-tripping through the DOM? Does the DOM specify UTF-16 encoding? If it >> does, that's silly. Both ES and DOM should specify "Unicode" and let the >> data interchange format be an implementation detail. It is an unfortunate >> accident of history that UTF-16 surrogate pairs leak their abstraction into >> ES Strings, and I believe it is high time we fixed that. >> >> Wes >> >> -- >> Wesley W. Garland >> Director, Product Development >> PageMail, Inc. >> +1 613 542 2787 x 102 >> >> _______________________________________________ >> es-discuss mailing list >> es-discuss@mozilla.org >> https://mail.mozilla.org/listinfo/es-discuss >> >> > > _______________________________________________ > es-discuss mailing list > es-discuss@mozilla.org > https://mail.mozilla.org/listinfo/es-discuss > >
_______________________________________________ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss