Re: Full Unicode strings strawman

Allen Wirfs-Brock Mon, 16 May 2011 12:16:10 -0700

On May 16, 2011, at 11:30 AM, Mike Samuel wrote:

> 2011/5/16 Allen Wirfs-Brock <al...@wirfs-brock.com>:
>> I tried to post a pointer to this strawman on this list a few weeks ago, but
>> apparently it didn't reach the list for some reason.
>> Feed back would be appreciated:
>> http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings
> 
> Will this change the behavior of character groups in regular
> expressions?  Would myString.match(/^.$/)[0].length ever have length
> 2?   Would it ever match a supplemental codepoint?
>


No, supplement codepoints are  single string characters and RegExp matching 
operates on such characters.  A string could, of course, contain character 
sequences that correspond to UTF-8, UTF-16, or other multi-unit encodings.  
However, from the perspective of Strings and RegExp those encodings would be 
multiple character sequences just like they are today.  The only ES functions 
currently proposed that would deal with multi-character encodings of 
supplemental codepoints are the URI handling functions.  However, it may be a 
good idea to add string-to-string UTF-8 and UTF-16 encode/decode functions that 
simply to the encode/decode and don't have all the other processing involved in 
encodeURI/decodeURI.


> How would the below, which replaces orphaned surrogates with U+FFFD
> when strings are viewed as sequences of UTF-16 code units behave?
> 
> myString.replace( /[\ud800-\udbff](?![\udc00-\uffff])/g, "\ufffd")
>    .replace( /(^|[^\ud800-\udbff])([\udc00-\udffff])/g, "\ufffd")

Exactly as it currently does, assuming it was applied to a string that didn't 
contain any codepoints greater than \uffff.   If the string contained any 
codepoints > \uffff those character would not match the pattern should be 
replaced.

The important thing two keep in mind here is that under this proposal, a 
supplemental codepoint is a single logical charater.  For example using a 
random character that isn't in the BMP:
"\u+02defc" === "\ud8ff\udefc";  //this is fale
"\u+02defc".length ===1  ;//this is true
"\ud8ff\udefc"===2;  //this is true

Existing code that manipulates surrogate pairs continues to work unmodified 
because such code is explicitly manipulating pairs of characters.  However, 
such code might produce unexpected results if handed a string containing a 
codepoint > \uffff .  But that takes an explicit action by someone to introduce 
such an enhanced character into a string.



> 
> 
>> Allen
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss@mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>> 
>> 

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

Reply via email to