Phillips, Addison wrote:
Why would converting the existing UCS-2 support to be UTF-16 not be a good idea? There is nothing 
intrinsically wrong that I can see with that approach and it would be the most compatible with 
existing scripts, with no special "modes", "flags", or interactions.

Allen proposed this, essentially (some confusion surrounded the discussion by mixing observable-in-language with encoding/format/serialization issues, leading to talk of 32-bit characters), last year. As I wrote in the o.p., this led to two objections: big implementation hit; incompatible change.

I tackled the second with the BRS and (in detail) mediation across DOM window boundaries. This I believe takes the sting out of the first (lesser implementation change in light of existing mediation at those boundaries).

Yes, the complexity of supplementary characters (i.e. non-BMP characters) 
represented as surrogate pairs must still be dealt with.

I'm not sure what you mean. JS today allows (ignoring invalid pairs) such surrogates but they count as two indexes and add two to length, not one. That is the first problem to fix (ignoring literal escape-notation expressiveness).

  It would also expose the possibility of invalid strings (with unpaired 
surrogates).

That problem exists today.

  But this would not be unlike other programming languages--or even ES as it 
exists today.

Right! We should do better. As I noted, Node.js heavy hitters (mranney of Voxer) testify that they want full Unicode, not what's specified today with indexing and length-accounting by uint16 storage units.

  The purity of a "Unicode string" would be watered down, but perhaps not 
fatally. The Java language went through this (yeah, I know, I know...) and seems to have 
emerged unscathed.

Java's dead on the client. It is used by botnets (bugzilla.mozilla.org recently suffered a DDOS from one, the bad guys didn't even bother changing the user-agent from the default one for the Java runtime). See Brian Krebs' blog.

  Norbert has a lovely doc here about the choices that lead to this, which 
seems useful to consider: [1]. W3C I18N Core WG has a wiki page shared with 
TC39 awhile ago here: [2].

To me, switching to UTF-16 seems like a relatively small, containable, 
non-destructive change to allow supplementary character support.

I still don't know what you mean. How would what you call "switching to UTF-16" differ from today, where one can inject surrogates into literals by transcoding from an HTML document or .js file CSE?

In particular, what do string indexing and .length count, uint16 units or characters?

/be
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to