Re: Full Unicode strings strawman

Norbert Lindenberg Tue, 17 May 2011 01:15:43 -0700

I have read the discussion so far, but would like to come back to thestrawman itself because I believe that it starts with a problemstatement that's incorrect and misleading the discussion. Correctlydescribing the current situation would help in the discussion ofpossible changes, in particular their compatibility impact.


The relevant portion of the problem statement:

"ECMAScript currently only directly supports the 16-bit basicmultilingual plane (BMP) subset of Unicode which is all that existedwhen ECMAScript was first designed. [...] As currently defined,characters in this expanded character set cannot be used in the sourcecode of ECMAScript programs and cannot be directly included in runtimeECMAScript string values."

My reading of the ECMAScript Language Specification, edition 5.1(January 2011), is:

1) ECMAScript allows, but does not require, implementations to supportthe full Unicode character set.

2) ECMAScript allows source code of ECMAScript programs to containcharacters from the full Unicode character set.

3) ECMAScript requires implementations to treat String values assequences of UTF-16 code units, and defines key functionality based onan interpretation of String values as sequences of UTF-16 code units,not based on an interpretation as sequences of Unicode code points.

4) ECMAScript prohibits implementations from conforming to the Unicodestandard with regards to case conversions.



The relevant text portions leading to these statements are:

1) Section 2, Conformance: "A conforming implementation of thisStandard shall interpret characters in conformance with the UnicodeStandard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2or UTF-16 as the adopted encoding form, implementation level 3. If theadopted ISO/IEC 10646-1 subset is not otherwise specified, it ispresumed to be the BMP subset, collection 300. If the adopted encodingform is not otherwise specified, it presumed to be the UTF-16 encodingform."

To interpret this, note that the Unicode Standard, Version 3.1 was thefirst one to encode actual supplementary characters [1], and that theonly difference between UCS-2 and UTF-16 is that UTF-16 supportssupplementary characters while UCS-2 does not [2].

2) Section 6, Source Text: "ECMAScript source text is represented as asequence of characters in the Unicode character encoding, version 3.0or later. [...] ECMAScript source text is assumed to be a sequence of16-bit code units for the purposes of this specification. [...] If anactual source text is encoded in a form other than 16-bit code unitsit must be processed as if it was first converted to UTF-16."

To interpret this, note again that the Unicode Standard, Version 3.1was the first one to encode actual supplementary characters, and thatthe conversion requirement enables the use of supplementary charactersrepresented as 4-byte UTF-8 characters in source text. As UTF-8 is nowthe most commonly used character encoding on the web [3], the 4-byteUTF-8 representation, not Unicode escape sequences, should be seen asthe normal representation of supplementary characters in ECMAScriptsource text.

3) Section 6, Source Text: "If an actual source text is encoded in aform other than 16-bit code units it must be processed as if it wasfirst converted to UTF-16. [...] Throughout the rest of this document,the phrase “code unit” and the word “character” will be used torefer to a 16-bit unsigned value used to represent a single 16-bitunit of text." Section 15.5.4.4, String.prototype.charCodeAt(pos):"Returns a Number (a nonnegative integer less than 2**16) representingthe code unit value of the character at position pos in the Stringresulting from converting this object to a String." Section 15.5.5.1length: "The number of characters in the String value represented bythis String object."

I don't like that the specification redefines a commonly used termsuch as "character" to mean something quite different ("code unit"),and hides that redefinition in a section on source text while applyingit primarily to runtime behavior. But there it is: Thanks to theredefinition, it's clear that charCodeAt() returns UTF-16 code units,and that the length property holds the number of UTF-16 code units inthe string.

4) Section 15.5.4.16, String.prototype.toLowerCase(): "For thepurposes of this operation, the 16-bit code units of the Strings aretreated as code points in the Unicode Basic Multilingual Plane.Surrogate code points are directly transferred from S to L without anymapping."

This does not meet Conformance Requirement C8 of the Unicode Standard,Version 6.0 [4]: "When a process interprets a code unit sequence whichpurports to be in a Unicode character encoding form, it shallinterpret that code unit sequence according to the corresponding codepoint sequence."



References:

[1] http://www.unicode.org/reports/tr27/tr27-4.html
[2] http://www.unicode.org/glossary/#U
[3] as Mark Davis reported at the Unicode Conference 2010
[4] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf


Best regards,
Norbert



On May 16, 2011, at 11:11 , Allen Wirfs-Brock wrote:

I tried to post a pointer to this strawman on this list a few weeksago, but apparently it didn't reach the list for some reason.
Feed back would be appreciated:

http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings

Allen


_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

Reply via email to