Re: Full Unicode strings strawman

Boris Zbarsky Mon, 16 May 2011 18:57:07 -0700

On 5/16/11 7:05 PM, Allen Wirfs-Brock wrote:

If you allow storage of such, then you're allowing mixing Unicode strings and 
"something else" (whatever the something else is), with bad most likely bad 
results.


Most simply, assignign a DOMString containing surrogates to a JS string should 
collapse the surrogate pairs into the corresponding codepoint if JS strings 
really contain codepoints...


No, that would be a breaking change to the web!


OK, we agree there.

The only way to make this work is if either DOMString is redefined or DOMString 
and full Unicode strings are different kinds of objects.


Not really, you need to make the distinction between what a String can contain 
and what String contents are valid in specific application domains.

DOMString seems to be quite clearly defined to consists of of 16-bit valued 
elements  interpreted as a UTF-16 encode Unicode string.


Right, because that's all it can be given what ES string is right now.

All such DOMStings are valid ES strings according to may proposal but it isn't 
the case all ES Strings are valid DOMStrings.

OK, what happens when such an ES string is passed to an interface thattakes a DOMString?

What happens when such an ES string is concatenated with a DOMStringcontaining surrogate pairs? I don't mean on the implementation levelfor the concatenation case; that part is trivial. I mean what can theprogrammer sanely do with the result?

To the depth of my understanding I that this is already the case today with 
16-bit ES characters.  You can create a ES string which does not conform to the 
UTF-16 encoding rules.

In practice browsers just let you put non-UTF16 stuff in the DOM andthen fake things, as far as I can tell. There are a few bugs around onnot allowing that, but the cost is too high (e.g. there are popularbrowser implementations in which the DOM and JS share the same referencecounted string buffer when you pass a string across the boundary).

Users doing surrogate pair decomposition will probably find that their code "just 
works"


How, exactly?


Because the string will continue to contain surrogate pairs.

Until someone somewhere else in the workflow (say an ad on the page, inthe browser context) adds in a string containing non-BMP codepoints, right?

Users creating Strings with surrogate pairs will need to
re-tool


Such users would include the DOM, right?


No. That would be a breaking change in the context of the browser.


OK...

Programs creating surrogate that want to be updated to not use surrogate pairs 
are the only ones that need to retool.

I think this is compartmentalizing programs in ways that don't map toreality.

More likely we are talking about new code that can be written without having to 
worry about surrogate pairs.

And again here. A lot of JS on the web takes strings from all sorts ofsources not necessarily under the control of the JS itself (user input,XMLHttpRequest of XML and JSON, etc), then mashes them together invarious ways.

If somebody wants to grab a bunch of text from the DOM and manipulate it 
without encountering surrogate pairs, they will need to explicit perform a 
decodeUTF16 transformation.

What if they don't want to encounter non-BMP characters except insurrogate pair form (i.e. have the environment they have now)?

You're talking every single web developer here.  Or at least every single web 
developer who wants to work with Devanagari text.


No, they will probably always have a choice for their own internal processing.  
Deal with logically 16-bit character that use UTF-16.  Or deal with logical 
21-bit characters.  Only when communicating with an external agent (for example 
the DOM) do you have to adapt to that agents requirments.

Web JS is _all_ about communicating with external agents. That's itspurpose in life, for the most part.

Somebody has to go first.  I'm saying that it has to be ES that goes first. ES 
can do this without breaking any existing web code.

I disagree on "somebody has to go first"; it should be possible tocoordinate such a change.


I agree that if we impose an ordering then clearly ES has to go first.

I think that if we made this change to ES only today and the newcapabilities were completely unused no existing web code would break.

I also think that if we made this change to ES only today and then partbut not all of the web got changed to use the new capabilities we wouldbreak some web code.

I will posit as an axiom that any changes to the web in terms ofadopting the new feature will be incremental (please let me know ifthere is reason to think this is not the case). A corollary that Ibelieve to be true is that we therefore have to assume that "oldstrings" and "new strings" will coexist in the set of strings scriptshave to handle as things stand. This may be true no matter what the DOMdoes, but is _definitely_ true if the DOM remains as it is.


-Boris
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

Reply via email to