Re: Unicode support in new ES6 spec draft

Norbert Lindenberg Wed, 11 Jul 2012 12:31:26 -0700

I haven't reviewed the new spec draft in detail yet, but have some comments on 
the comments from Rich and Allen - see below.

Norbert

On Jul 10, 2012, at 20:53 , Allen Wirfs-Brock wrote:

> 
> On Jul 10, 2012, at 7:50 PM, Gillam, Richard wrote:
> 
>> Allen--
>> 
>> A few comments on the i18n/Unicode-related stuff in the latest draft:
>> 
>> - p. 1, §2: It seems a little weird here to be specifying a particular 
>> version of the Unicode standard but not of ISO 10646.  Down in section 3, 
>> you _do_ nail down the version of 10646 and it's long, so I can see why you 
>> don't want all this verbiage in section 2 as well, but maybe you want more 
>> than you have?
> 
> i'm not sure.  This was Norbert recommendation.  I'm liked want we did in ES5 
> where we specified version 3.0 as minimum version for which there was 
> guaranteed interoperability but allowed use of more recent versions.  Input 
> on what would make the most sense is appreciated.

Rich's comment was on the lack of any version number for ISO 10646, not on the 
Unicode version number. We can simplify the statement in clause 2 to "A 
conforming implementation of this Standard shall interpret characters in 
conformance with the Unicode Standard and ISO/IEC 10646, both in the versions 
referenced in clause 3."

>> - p. 14 §6: More substantively, do you really need to go into this level of 
>> detail as to what a "Unicode character" is?  I would think you could say 
>> something like "ECMAScript source text is a sequence of Unicode abstract 
>> code point values (or, in this spec, "Unicode characters").  The actual 
>> representation of those characters in bits (e.g., UTF-16 or UTF-32 or even a 
>> non-Unicode encoding) is implementation-dependent, but a conforming 
>> implementation must process source text as if it were an equivalent sequence 
>> of SourceCharacter values."  I think that for the purposes of this spec, how 
>> "Unicode code point" maps to a normal human's idea of "character" is 
>> irrelevant; you can define "character" to mean the same thing as Unicode 
>> means when it says "code point" and be done with it.  (This probably means 
>> you can ether get rid of the next paragraph, or at least that that paragraph 
>> is entirely informative.)
> 
> First, I have so say that there will probably be some controversy about this 
> section in TC39.  Norbert proposal was that we specify SourceCharacter as 
> always being UTF-16 encoded, while I went in the direction of essentially 
> defining it as abstract characters  identified by code points. No doubt  
> there will be additional discussion about this.
> 
> There is enough confusion concerning ECMAScript source code (code point vs 
> code unit, UTF-16 or not, etc.) in previous editions, that I wanted to be as 
> clear as possible.  The key point is just that the ECMAScript specification 
> is assigning a meanings to certain Unicode characters/character sequences and 
> that meaning is independent of  file encoding processing that may take place 
> within an implementation.

I think basing the specification on UTF-16 code units as source code would be 
easier, but using Unicode code points as the basis isn't wrong either.

We should stay away from the terms "character", "Unicode character", or 
"Unicode scalar value" however.

For "character", people have different ideas what the term means, and 
redefining it, as ES5 did, would just add to the confusion.

"Unicode character" is not defined in the Unicode standard, as far as I can 
tell, but seems to be used in the sense of "code point assigned to abstract 
character" or possibly "designated code point". With either definition, it 
would exclude code points reserved for future assignment, such as characters 
that were added in Unicode 6.1 if your implementation was based on Unicode 5.1. 
Such a restriction would be a constant source of interoperability problems.

"Unicode scalar value" is defined in the Unicode standard as "Any Unicode code 
point except high-surrogate and low-surrogate code points." We cannot exclude 
surrogate code points from source code, as this would break compatibility with 
existing code.

"Unicode code point" and "UTF-16 code unit" are the terms we have to use most 
of the time.

I agree with Rich that we should limit the discussion to what's relevant to the 
spec.

>> - p. 19, §7.6: I tend to agree with your comment here-- since this was 
>> nailed to Unicode 3.0 before, it seems better to stick with that when we're 
>> talking about "portability" (although a note explaining why it's not Unicode 
>> 5.1 might be helpful).

I disagree. Unicode 5.1 support is part of ES6 just like the "let" and "class" 
keywords. I assume we're not going to tell programmers to stay away from "let" 
and "class". Why should we tell them to stay away from Unicode 5.1?

This paragraph is really about the fact that some implementations will support 
Unicode 6.1 or later by the time ES6 becomes a standard, while others will be 
stuck at Unicode 5.1. Using characters that were introduced in Unicode 6.1 in 
identifiers would mean that the application only runs on implementations based 
on Unicode 6.1 or higher, not on those based on Unicode 6.0 or lower.

>> - p. 24, §7.8.4: In earlier versions of ECMAScript, I could often specify a 
>> supplementary-plane character by using two Unicode escape sequences in a 
>> row, each representing a surrogate code unit value.  Can I still do that?  
>> It seems like you'd have to support this for backward compatibility, but 
>> you're not really supposed to see bare surrogates in any context except for 
>> UTF-16 (I don't think they're strictly illegal, except in UTF-8, but the 
>> code point sequence <D800 DC00> isn't equivalent to U+10000, either.  I 
>> think you want some verbiage here clarifying how this is supposed to work. A 
>> \uNNNN escape is a BMP code point so it will always contribute exactly one 
>> element to the string value. 
> 
> I believe that the algorithmic text of the spec. is clear in this regard.  
> But informative text could be added.
> 
> What actually happens idepends upon contextual details that your 4th item 
> refers to.
> 
> In a string literal, each non-escape sequence character contributes one code 
> point to the literal.  String values are made up of 16-bit elements, with 
> non-BMP code points being UTF-16 encoded as two elements.   A \u{ } escape 
> also represents one code point that, depending upon its value, will 
> contribute one or two elements to the string value.  A \uNNNN escape 
> represents a BMP code point so it is always represented as one string 
> element.  If you have existing code that like " \ud800\udc00" you will get 
> the same two element string value that you would get if you wrote "\u{10000}" 
> or \u{d800}\u{dc00}".  They are all alternative ways of expressing the same 
> two element string value.  The first form must be supported for backwards 
> compatibility. 
> 
> Outside (I'm actually glossing over a couple of other exceptions) of string 
> literals (and friends, such as quasis) we don't have to worry about UTF-16 
> encoding because we are dealing with more abstract concepts, such as 
> "identifiers" that we can deal with at the level of Unicode characters.  We 
> also don't have backwards compat. issues because in those contexts current 
> implementations look at surrogate  pairs as two distinct "characters", either 
> of which result in syntax errors in all non-literal context.  EG,  current 
> implementations reject identifiers containing supplementary characters that 
> are (according to Unicode) legal identifier characters. 

Careful here. I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 
≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around 
that convert any non-ASCII characters into (old-style) Unicode escapes.

>> - p. 210, §15.5.3.2: I like the idea of introducing fromCodeUnit(), making 
>> this function an alias of that one, and marking this function as obsolete.  
>> But I'm also wondering if it would make more sense for this function to be 
>> called fromCodeUnits(), since you can specify a whole list of code units, 
>> and they all contribute to the string.
> 
> singular form is following the convention established by fromCharCode.  It 
> could change if nobody objects.

fromCodeUnit seems rather redundant. Note that any code unit sequence it 
accepts would be equally accepted, with the same result, by fromCodePoints, as 
that function accepts surrogate code points and then, in the conversion to 
UTF-16, erases the distinction between surrogate code points and surrogate code 
units.

>> - p. 210, §15.5.3.3: Same thing: Maybe call this fromCodePoints()?  [Note 
>> also you have a copy-and-paste problem on the first line, where it still 
>> says "String.fromCharCode()".]  

fromCodePoints would be fine. There are a few more copy-and-paste references to 
codeUnits, and a "codePoint" missing an "s".

>> - p. 212, §15.5.4.4: I like the idea of adding a new name for this function, 
>> but I'm thinking maybe codeUnitAt().  Or do what Java did (IIRC): Add a new 
>> function called char32At(), which behaves like this one, except that if the 
>> index you give it points to one half of a surrogate pair, you return the 
>> code point value represented by the surrogate pair.  (If you don't do some 
>> sort of char32At() function, you're probably going to need a function that 
>> takes a sequence of UTF-16 code unit values and returns a sequence of 
>> Unicode code point values.)
> 
> I also have thought about unicodeCharAt() or perhaps uCharAt().  

codeUnitAt is clearly a better name than charAt (especially in a language that 
doesn't have a char data type), but since we can't get rid of charAt, I'm not 
sure it's worth adding codeUnitAt.

I'm not aware of any char32At function in Java. Do you mean codePointAt? That's 
in both Java and the ES6 draft.

>> - p. 220, §§15.5.4.17 and 15.5.4.19: Maybe this is a question for Norbert: 
>> Are we allowing somewhere for versions of toLocaleUpperCase() and 
>> toLocaleLowerCase() that let you specify the locale as a parameter instead 
>> of just using the host environment's default locale?
> 
> this is covered by the I18N API spec. Right?

It's not in the Internationalization API edition 1, but seems a prime candidate 
for edition 2.

>> - p. 223, §15.5.4.5: First, did something go haywire with the numbering 
>> here?  Second, this sort of addresses my comment above, but if you can't put 
>> this and charCodeAt() (or whatever we called it) together in the spec, can 
>> you include a pointer in charCodeAt()'s description to here?  Third, it 
>> looks like this only works right with surrogate pairs if you specify the 
>> position of the first surrogate in the pair.  I think you want it to work 
>> right if you specify the position of either element in the pair.  (I think 
>> you may have a typo in step 11 as well: shouldn't that be "…or second > 
>> 0xDFFF"?)
> 
> It's supposed to be 15.5.4.25. also yes about the step 11 typo.  Norbert 
> proposed this function so we should get his thoughts on the addressing issue. 
>  As I wrote this I did think a bit about whether or not we need to provide 
> some support for  backward iteration over strings.

At some point we have to give chapter 15 a logical structure again, rather than 
just offering sediment layers.

Requiring the correct position is intentional; it's the same in 
java.lang.String.codePointAt. If we want to support backwards iteration, we 
could add codePointBefore.

There are more issues with this function, which I'll comment on separately.

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Unicode support in new ES6 spec draft

Reply via email to