RE: Full Unicode strings strawman

Phillips, Addison Tue, 17 May 2011 12:01:24 -0700

Note: The W3C Internationalization Core WG published a set of "requirements" in 
this area for consideration by ES some time ago. It lives here:


   http://www.w3.org/International/wiki/JavaScriptInternationalization 

The section on 'locale related behavior' is being separately addressed.

I think that:

1. Changing references from UCS-2 to UTF-16 makes sense, although the spec, 
IIRC, already *says* UTF-16.
2. Allowing unpaired surrogates is a *requirement*. Yes, such a string is 
"ill-formed", but there are too many cases in which one might wish to have such 
"broken" strings for scripting purposes.
3. We should have escape syntax for supplementary characters (such as 
\U0010000). Looking up the surrogate pair for a given Unicode character is 
extremely inconvenient and is not self-documenting.

As Shawn notes, basically, there are three ways that one might wish to access 
strings:

- as grapheme clusters (visual units of text)
- as Unicode scalar values (logical units of text, i.e. characters)
- as code units (encoding units of text)

The example I use in the Unicode conference internationalization tutorial is a 
box on a Web site with an ES controlled message underneath it saying "You have 
200 characters remaining."

I think it is instructive to look at how Java managed this transition. In some 
cases the "200" represents the number of storage units I have available (as in 
my backing database), in which case String.length is what I probably want. In 
some cases I want to know how many Unicode characters there are (Java solves 
this with the codePointCount(), codePointBefore(), and codePointAt() methods). 
These are relatively rare operations, but they have occasional utility. Or I 
may want grapheme clusters (Java attempts to solve this with BreakIterators and 
I tend to favor doing the same thing in JavaScript---default grapheme clusters 
are better than nothing, but language-specific grapheme clusters are more 
useful).

If we follow the above, providing only minimal additional methods for accessing 
codepoints when necessary, this also limits the impact of adding supplementary 
character support to the language. Regex probably works the way one supposes 
(both \U0010000 and \ud800\udc00 find the surrogate pair \ud800\udc00 and one 
can still find the low surrogate \udc00 if one wishes too). And existing 
scripts will continue to function without alteration. However, new scripts can 
be written that use supplementary characters. 

Regards,

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.


> -----Original Message-----
> From: Shawn Steele [mailto:shawn.ste...@microsoft.com]
> Sent: Tuesday, May 17, 2011 11:09 AM
> To: Brendan Eich; Boris Zbarsky
> Cc: es-discuss
> Subject: RE: Full Unicode strings strawman
> 
> I would much prefer changing "UCS-2" to "UTF-16", thus formalizing that
> surrogate pairs are permitted.  That'd be very difficult to break any existing
> code and would still allow representation of everything reasonable in Unicode.
> 
> That would enable Unicode, and allow extending string literals and regular
> expressions for convenience with the U+10FFFF style notation (which would be
> equivalent to the surrogate pair).  The character code manipulation functions
> could be similarly augmented without breaking anything (and maybe not
> needing different names?)
> 
> You might want to qualify the UTF-16 as allowing, but strongly discouraging,
> lone surrogates for those people who didn't realize their binary data wasn't a
> string.
> 
> The sole disadvantage would be that iterating through a string would require
> consideration of surrogates, same as today.  The same caution is also 
> necessary
> to avoid splitting Ä (U+0041 U+0308) into its component A and   ̈ parts.  I
> wouldn't be opposed to some sort of helper functions or classes that aided in
> walking strings, preferably with options to walk the graphemes (or whatever),
> not just the surrogate pairs.  FWIW: we have such a helper for surrogates
> in .Net and "nobody uses them".  The most common feedback is that it's not
> that helpful because it doesn't deal with the graphemes.
> 
> - Shawn
> 
> shawn.ste...@microsoft.com
> Senior Software Design Engineer
> Microsoft Windows

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

RE: Full Unicode strings strawman

Reply via email to