I have read the discussion so far, but would like to come back to the
strawman itself because I believe that it starts with a problem
statement that's incorrect and misleading the discussion. Correctly
describing the current situation would help in the discussion of
possible changes, in particular their compatibility impact.
The relevant portion of the problem statement:
"ECMAScript currently only directly supports the 16-bit basic
multilingual plane (BMP) subset of Unicode which is all that existed
when ECMAScript was first designed. [...] As currently defined,
characters in this expanded character set cannot be used in the source
code of ECMAScript programs and cannot be directly included in runtime
ECMAScript string values."
My reading of the ECMAScript Language Specification, edition 5.1
(January 2011), is:
1) ECMAScript allows, but does not require, implementations to support
the full Unicode character set.
2) ECMAScript allows source code of ECMAScript programs to contain
characters from the full Unicode character set.
3) ECMAScript requires implementations to treat String values as
sequences of UTF-16 code units, and defines key functionality based on
an interpretation of String values as sequences of UTF-16 code units,
not based on an interpretation as sequences of Unicode code points.
4) ECMAScript prohibits implementations from conforming to the Unicode
standard with regards to case conversions.
The relevant text portions leading to these statements are:
1) Section 2, Conformance: "A conforming implementation of this
Standard shall interpret characters in conformance with the Unicode
Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2
or UTF-16 as the adopted encoding form, implementation level 3. If the
adopted ISO/IEC 10646-1 subset is not otherwise specified, it is
presumed to be the BMP subset, collection 300. If the adopted encoding
form is not otherwise specified, it presumed to be the UTF-16 encoding
form."
To interpret this, note that the Unicode Standard, Version 3.1 was the
first one to encode actual supplementary characters [1], and that the
only difference between UCS-2 and UTF-16 is that UTF-16 supports
supplementary characters while UCS-2 does not [2].
2) Section 6, Source Text: "ECMAScript source text is represented as a
sequence of characters in the Unicode character encoding, version 3.0
or later. [...] ECMAScript source text is assumed to be a sequence of
16-bit code units for the purposes of this specification. [...] If an
actual source text is encoded in a form other than 16-bit code units
it must be processed as if it was first converted to UTF-16."
To interpret this, note again that the Unicode Standard, Version 3.1
was the first one to encode actual supplementary characters, and that
the conversion requirement enables the use of supplementary characters
represented as 4-byte UTF-8 characters in source text. As UTF-8 is now
the most commonly used character encoding on the web [3], the 4-byte
UTF-8 representation, not Unicode escape sequences, should be seen as
the normal representation of supplementary characters in ECMAScript
source text.
3) Section 6, Source Text: "If an actual source text is encoded in a
form other than 16-bit code units it must be processed as if it was
first converted to UTF-16. [...] Throughout the rest of this document,
the phrase “code unit” and the word “character” will be used to
refer to a 16-bit unsigned value used to represent a single 16-bit
unit of text." Section 15.5.4.4, String.prototype.charCodeAt(pos):
"Returns a Number (a nonnegative integer less than 2**16) representing
the code unit value of the character at position pos in the String
resulting from converting this object to a String." Section 15.5.5.1
length: "The number of characters in the String value represented by
this String object."
I don't like that the specification redefines a commonly used term
such as "character" to mean something quite different ("code unit"),
and hides that redefinition in a section on source text while applying
it primarily to runtime behavior. But there it is: Thanks to the
redefinition, it's clear that charCodeAt() returns UTF-16 code units,
and that the length property holds the number of UTF-16 code units in
the string.
4) Section 15.5.4.16, String.prototype.toLowerCase(): "For the
purposes of this operation, the 16-bit code units of the Strings are
treated as code points in the Unicode Basic Multilingual Plane.
Surrogate code points are directly transferred from S to L without any
mapping."
This does not meet Conformance Requirement C8 of the Unicode Standard,
Version 6.0 [4]: "When a process interprets a code unit sequence which
purports to be in a Unicode character encoding form, it shall
interpret that code unit sequence according to the corresponding code
point sequence."
References:
[1] http://www.unicode.org/reports/tr27/tr27-4.html
[2] http://www.unicode.org/glossary/#U
[3] as Mark Davis reported at the Unicode Conference 2010
[4] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
Best regards,
Norbert
On May 16, 2011, at 11:11 , Allen Wirfs-Brock wrote:
I tried to post a pointer to this strawman on this list a few weeks
ago, but apparently it didn't reach the list for some reason.
Feed back would be appreciated:
http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings
Allen
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss