Re: Full Unicode strings strawman

Boris Zbarsky Mon, 16 May 2011 14:42:59 -0700

On 5/16/11 4:38 PM, Wes Garland wrote:

Two great things about strings composed of Unicode code points:

...

If though this is a breaking change from ES-5, I support it
whole-heartedly.... but I expect breakage to be very limited. Provided
that the implementation does not restrict the storage of reserved code
points (D800-DF00)


Those aren't code points at all.  They're just not Unicode.

If you allow storage of such, then you're allowing mixing Unicodestrings and "something else" (whatever the something else is), with badmost likely bad results.

Most simply, assignign a DOMString containing surrogates to a JS stringshould collapse the surrogate pairs into the corresponding codepoint ifJS strings really contain codepoints...

The only way to make this work is if either DOMString is redefined orDOMString and full Unicode strings are different kinds of objects.

Users doing surrogate pair decomposition will probably find that their code "just 
works"


How, exactly?

Users creating Strings with surrogate pairs will need to
re-tool


Such users would include the DOM, right?

but this is a small burden and these users will be at the upper
strata of Unicode-foodom.

You're talking every single web developer here. Or at least everysingle web developer who wants to work with Devanagari text.

I suspect that 99.99% of users will find that
this change will fix bugs in their code when dealing with non-BMP
characters.

Not unless DOMString is changed or the interaction between the two verycarefully defined in failure-proof ways.

Why do we care about the UTF-16 representation of particular
codepoints?

Because of DOMString's use of UTF-16, at least (forced on it by the factthat that's what ES used to do, but here we are).

Mike Samuel, can you explain why you are en/decoding UTF-16 when
round-tripping through the DOM?  Does the DOM specify UTF-16 encoding?


Yes.

If it does, that's silly.

It needed to specify _something_, and UTF-16 was the thing that wascompatible with how scripts work in ES. Not to mention the Java legacyif the DOM...

Both ES and DOM should specify "Unicode" and let the data interchange format be 
an implementation detail.

That's fine if _both_ are changed. Changing just one without the otherwould just cause problems.

It is an unfortunate accident of history that UTF-16 surrogate pairs leak their
abstraction into ES Strings, and I believe it is high time we fixed that.

If you can do that without breaking web pages, great. If not, then weneed to talk. ;)


-Boris
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

Reply via email to