
A few comments on the i18n/Unicode-related stuff in the latest draft:

- p. 1, §2: It seems a little weird here to be specifying a particular version 
of the Unicode standard but not of ISO 10646.  Down in section 3, you _do_ nail 
down the version of 10646 and it's long, so I can see why you don't want all 
this verbiage in section 2 as well, but maybe you want more than you have?

- p. 14, §6: A few typos: "The phrase 'Unicode character referS to…"  The S is 
missing.  "Each source character being an abstract Unicode characterS…" The S 
is unnecessary.

- p. 14 §6: More substantively, do you really need to go into this level of 
detail as to what a "Unicode character" is?  I would think you could say 
something like "ECMAScript source text is a sequence of Unicode abstract code 
point values (or, in this spec, "Unicode characters").  The actual 
representation of those characters in bits (e.g., UTF-16 or UTF-32 or even a 
non-Unicode encoding) is implementation-dependent, but a conforming 
implementation must process source text as if it were an equivalent sequence of 
SourceCharacter values."  I think that for the purposes of this spec, how 
"Unicode code point" maps to a normal human's idea of "character" is 
irrelevant; you can define "character" to mean the same thing as Unicode means 
when it says "code point" and be done with it.  (This probably means you can 
ether get rid of the next paragraph, or at least that that paragraph is 
entirely informative.)

- p. 14, §6, ¶3: "Within other contexts, such an escape sequence contextually 
contributes one Unicode character."  This read kind of funny to me-- I didn't 
know what "contextually" meant.  Before, you had something like "Within a 
string literal, an escape sequence contributes one Unicode character tot he 
value of the literal", and I suspect "contextually" was intended to mean that 
the escape sequence contributes one character in whatever way is appropriate 
for the context.  I wonder if it would be better to say something like "In all 
other contexts, an escape sequence is treated identically to the Unicode 
character with the specified Unicode scalar value" or something like that.  
(That probably need some wordsmithing, though, since \u000a would be equivalent 
to \n most of the time, not to a literal line-feed character, as you point out 
in the next paragraph.)

- p. 14, §6, ¶3: Should there be a pointer here to the actual definition of a 
Unicode escape sequence?

- p. 14, §6: I suppose it doesn't hurt to explain how to map an abstract 
Unicode value to UTF-16 here, but couldn't you just point to the definition of 
UTF-16 in the Unicode standard?

- p. 19, §7.6: I tend to agree with your comment here-- since this was nailed 
to Unicode 3.0 before, it seems better to stick with that when we're talking 
about "portability" (although a note explaining why it's not Unicode 5.1 might 
be helpful).

- p. 19, §7.6: "Return the String value consisting of the sequence of code 
units…" Do you want to say "UTF-16 code units" here and in other places where 
this occurs, and maybe define that up front?  Each of the Unicode encoding 
forms (UTF-8 and UTF-32 as well as UTF-16) has its own definition of "code 

- p. 24, §7.8.4: In earlier versions of ECMAScript, I could often specify a 
supplementary-plane character by using two Unicode escape sequences in a row, 
each representing a surrogate code unit value.  Can I still do that?  It seems 
like you'd have to support this for backward compatibility, but you're not 
really supposed to see bare surrogates in any context except for UTF-16 (I 
don't think they're strictly illegal, except in UTF-8, but the code point 
sequence <D800 DC00> isn't equivalent to U+10000, either.  I think you want 
some verbiage here clarifying how this is supposed to work.

- p. 25, "Early Errors": You say it's a Syntax Error if you specify a \u{} 
escape sequence with a value greater than 10FFFF.  Should \u{} escape sequences 
with values corresponding to the surrogate range also produce syntax errors?  
If not, is \u{d800}\u{dc00} equivalent to \ud800\udc00 (which I presume is 
equivalent to \u{10000})?

- p. 210, § I like the idea of introducing fromCodeUnit(), making this 
function an alias of that one, and marking this function as obsolete.  But I'm 
also wondering if it would make more sense for this function to be called 
fromCodeUnits(), since you can specify a whole list of code units, and they all 
contribute to the string.

- p. 210, § Same thing: Maybe call this fromCodePoints()?  [Note also 
you have a copy-and-paste problem on the first line, where it still says 

- p. 212, § I like the idea of adding a new name for this function, 
but I'm thinking maybe codeUnitAt().  Or do what Java did (IIRC): Add a new 
function called char32At(), which behaves like this one, except that if the 
index you give it points to one half of a surrogate pair, you return the code 
point value represented by the surrogate pair.  (If you don't do some sort of 
char32At() function, you're probably going to need a function that takes a 
sequence of UTF-16 code unit values and returns a sequence of Unicode code 
point values.)

- p. 212, § Same comments as above.  I think you're either going to 
need to add a charCode32At() or a function that converts sequences of code 
units to sequences of code points.  You might also need to add some kind of 
function to facilitate iterating through a string by code point.

- pp. 212-213, § Should you say "code units" instead of "string 

- p. 220, §§ and Maybe this is a question for Norbert: Are 
we allowing somewhere for versions of toLocaleUpperCase() and 
toLocaleLowerCase() that let you specify the locale as a parameter instead of 
just using the host environment's default locale?

- p. 223, § First, did something go haywire with the numbering here?  
Second, this sort of addresses my comment above, but if you can't put this and 
charCodeAt() (or whatever we called it) together in the spec, can you include a 
pointer in charCodeAt()'s description to here?  Third, it looks like this only 
works right with surrogate pairs if you specify the position of the first 
surrogate in the pair.  I think you want it to work right if you specify the 
position of either element in the pair.  (I think you may have a typo in step 
11 as well: shouldn't that be "…or second > 0xDFFF"?)

- pp. 223-224, § How come you changed "characters" to "elements" in 
some spots and "code units" in others?  Is there a difference?  (I'm seeing 
this in some of the number-formatting stuff too.)

Thanks a lot…

--Rich Gillam

On Jul 8, 2012, at 6:22 PM, Allen Wirfs-Brock wrote:

Rev9 (July 8, 2012) of the ES6 Draft Specification is now available at

Changes in this version include:

 Quasi literal added to specification
Initial work at defining tail call semantics (still need to define tail 
positions in 13.7)
Initial pass at replacing native/host object terminology with ordinary/exotic 
Clause 6 and others updated to clarify processing of full Unicode source code. 
Revised usage of “code unit” and “code point”
Specification of Identifiers updated to use current Unicode specification 
\u{nnnnnn} Unicode code point escapes added
UTF-16 encoding for non-BMP characters in string literals now fully specified
Added functions: String.fromCodePoint, String.raw (a quasi tag function), 
ECMAScript now requires use of Unicode 5.1.0, normative references updated
A syntactic grammar notation was added for indicating when alternative lexical 
goals are required
Fixed ES5 missing explicitly setting length in several array functions
Fixed bugs: 368, 388-399, 402-405, 410-413, 415-416,418, 420-428, 430-439, 
445-456,458-461 (thanks very much for all the bug reports)

Please report bugs you find at<>.
es-discuss mailing list<>

es-discuss mailing list

Reply via email to