On May 19, 2011, at 2:06 PM, Shawn Steele wrote:

> There are several sequences in Unicode which are meaningless if you have only 
> one character and not the other.  Eg: any of the variation selectors by 
> themselves are meaningless.  So if you break a modified character from its 
> variation selector you've damaged the string.  That's pretty much identical 
> to splitting a high surrogate from its low surrogate.  
> ...
> Things that UTF-32 works for without special cases:
> * Ordinal collation/sorting (eg: non-linguistic (so why is it a string?))

This is exactly my point.  The string data type in ECMAScript is 
non-linguistic.  There is noting Unicode specific about the fundamental 
ECMAScript  string data type nor of any of the language operations 
(concatenation, comparison, length determination, character access) upon 
strings. Similarly, the majority of String method also have no specific Unicode 
semantic dependencies  (the exception are the for toUpper/LowerCase methods and 
they don't treat surrogate pairs as a unit).  The string data type can be used 
for many purposes that have nothing to do with the linguistic semantics of 
Unicode. That is why linguistic based arguments seem to be missing the point.

Where there is a potential connection between Unicode semantics and the string 
data type is in the interpretation of ECMAScript string literals as 
constructors of string values. ECMAScript is biased towards Unicode in the 
sense that it only supports a Unicode interpretation of string literals. 
However currently ECMAScript literals can only contain BMP characters and 
escape sequences that produce BMP code points and these are directly 
represented in string values as 16-bit character codes.  Given that level of 
Unicode bias in the language there is obvious utility  in allowing literals to 
contain any Unicode character and in supporting the generation of string values 
that use Unicode UTF-16 (and possibly alternatively UTF-8) encodings semantics. 
 The utility of such features seems independent of the underlying size of the 
string type's character codes.


es-discuss mailing list

Reply via email to