On 5/16/11 7:05 PM, Allen Wirfs-Brock wrote:
If you allow storage of such, then you're allowing mixing Unicode strings and 
"something else" (whatever the something else is), with bad most likely bad 
results.

Most simply, assignign a DOMString containing surrogates to a JS string should 
collapse the surrogate pairs into the corresponding codepoint if JS strings 
really contain codepoints...

No, that would be a breaking change to the web!

OK, we agree there.

The only way to make this work is if either DOMString is redefined or DOMString 
and full Unicode strings are different kinds of objects.

Not really, you need to make the distinction between what a String can contain 
and what String contents are valid in specific application domains.

DOMString seems to be quite clearly defined to consists of of 16-bit valued 
elements  interpreted as a UTF-16 encode Unicode string.

Right, because that's all it can be given what ES string is right now.

All such DOMStings are valid ES strings according to may proposal but it isn't 
the case all ES Strings are valid DOMStrings.

OK, what happens when such an ES string is passed to an interface that takes a DOMString?

What happens when such an ES string is concatenated with a DOMString containing surrogate pairs? I don't mean on the implementation level for the concatenation case; that part is trivial. I mean what can the programmer sanely do with the result?

To the depth of my understanding I that this is already the case today with 
16-bit ES characters.  You can create a ES string which does not conform to the 
UTF-16 encoding rules.

In practice browsers just let you put non-UTF16 stuff in the DOM and then fake things, as far as I can tell. There are a few bugs around on not allowing that, but the cost is too high (e.g. there are popular browser implementations in which the DOM and JS share the same reference counted string buffer when you pass a string across the boundary).

Users doing surrogate pair decomposition will probably find that their code "just 
works"

How, exactly?

Because the string will continue to contain surrogate pairs.

Until someone somewhere else in the workflow (say an ad on the page, in the browser context) adds in a string containing non-BMP codepoints, right?

Users creating Strings with surrogate pairs will need to
re-tool

Such users would include the DOM, right?

No. That would be a breaking change in the context of the browser.

OK...

Programs creating surrogate that want to be updated to not use surrogate pairs 
are the only ones that need to retool.

I think this is compartmentalizing programs in ways that don't map to reality.

More likely we are talking about new code that can be written without having to 
worry about surrogate pairs.

And again here. A lot of JS on the web takes strings from all sorts of sources not necessarily under the control of the JS itself (user input, XMLHttpRequest of XML and JSON, etc), then mashes them together in various ways.

If somebody wants to grab a bunch of text from the DOM and manipulate it 
without encountering surrogate pairs, they will need to explicit perform a 
decodeUTF16 transformation.

What if they don't want to encounter non-BMP characters except in surrogate pair form (i.e. have the environment they have now)?

You're talking every single web developer here.  Or at least every single web 
developer who wants to work with Devanagari text.

No, they will probably always have a choice for their own internal processing.  
Deal with logically 16-bit character that use UTF-16.  Or deal with logical 
21-bit characters.  Only when communicating with an external agent (for example 
the DOM) do you have to adapt to that agents requirments.

Web JS is _all_ about communicating with external agents. That's its purpose in life, for the most part.

Somebody has to go first.  I'm saying that it has to be ES that goes first. ES 
can do this without breaking any existing web code.

I disagree on "somebody has to go first"; it should be possible to coordinate such a change.

I agree that if we impose an ordering then clearly ES has to go first.

I think that if we made this change to ES only today and the new capabilities were completely unused no existing web code would break.

I also think that if we made this change to ES only today and then part but not all of the web got changed to use the new capabilities we would break some web code.

I will posit as an axiom that any changes to the web in terms of adopting the new feature will be incremental (please let me know if there is reason to think this is not the case). A corollary that I believe to be true is that we therefore have to assume that "old strings" and "new strings" will coexist in the set of strings scripts have to handle as things stand. This may be true no matter what the DOM does, but is _definitely_ true if the DOM remains as it is.

-Boris
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to