I have read the discussion so far, but would like to come back to the strawman itself because I believe that it starts with a problem statement that's incorrect and misleading the discussion. Correctly describing the current situation would help in the discussion of possible changes, in particular their compatibility impact.

The relevant portion of the problem statement:

"ECMAScript currently only directly supports the 16-bit basic multilingual plane (BMP) subset of Unicode which is all that existed when ECMAScript was first designed. [...] As currently defined, characters in this expanded character set cannot be used in the source code of ECMAScript programs and cannot be directly included in runtime ECMAScript string values."


My reading of the ECMAScript Language Specification, edition 5.1 (January 2011), is:

1) ECMAScript allows, but does not require, implementations to support the full Unicode character set.

2) ECMAScript allows source code of ECMAScript programs to contain characters from the full Unicode character set.

3) ECMAScript requires implementations to treat String values as sequences of UTF-16 code units, and defines key functionality based on an interpretation of String values as sequences of UTF-16 code units, not based on an interpretation as sequences of Unicode code points.

4) ECMAScript prohibits implementations from conforming to the Unicode standard with regards to case conversions.


The relevant text portions leading to these statements are:

1) Section 2, Conformance: "A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form."

To interpret this, note that the Unicode Standard, Version 3.1 was the first one to encode actual supplementary characters [1], and that the only difference between UCS-2 and UTF-16 is that UTF-16 supports supplementary characters while UCS-2 does not [2].

2) Section 6, Source Text: "ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. [...] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. [...] If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16."

To interpret this, note again that the Unicode Standard, Version 3.1 was the first one to encode actual supplementary characters, and that the conversion requirement enables the use of supplementary characters represented as 4-byte UTF-8 characters in source text. As UTF-8 is now the most commonly used character encoding on the web [3], the 4-byte UTF-8 representation, not Unicode escape sequences, should be seen as the normal representation of supplementary characters in ECMAScript source text.

3) Section 6, Source Text: "If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16. [...] Throughout the rest of this document, the phrase “code unit” and the word “character” will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text." Section 15.5.4.4, String.prototype.charCodeAt(pos): "Returns a Number (a nonnegative integer less than 2**16) representing the code unit value of the character at position pos in the String resulting from converting this object to a String." Section 15.5.5.1 length: "The number of characters in the String value represented by this String object."

I don't like that the specification redefines a commonly used term such as "character" to mean something quite different ("code unit"), and hides that redefinition in a section on source text while applying it primarily to runtime behavior. But there it is: Thanks to the redefinition, it's clear that charCodeAt() returns UTF-16 code units, and that the length property holds the number of UTF-16 code units in the string.

4) Section 15.5.4.16, String.prototype.toLowerCase(): "For the purposes of this operation, the 16-bit code units of the Strings are treated as code points in the Unicode Basic Multilingual Plane. Surrogate code points are directly transferred from S to L without any mapping."

This does not meet Conformance Requirement C8 of the Unicode Standard, Version 6.0 [4]: "When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall interpret that code unit sequence according to the corresponding code point sequence."


References:

[1] http://www.unicode.org/reports/tr27/tr27-4.html
[2] http://www.unicode.org/glossary/#U
[3] as Mark Davis reported at the Unicode Conference 2010
[4] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf


Best regards,
Norbert



On May 16, 2011, at 11:11 , Allen Wirfs-Brock wrote:

I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason.

Feed back would be appreciated:

http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings

Allen

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to