I added these functions to String because that seems the best place for them in 
the current arrangement. I'm aware of the proposal to modularize the standard 
library [1] and can well imagine that these functions will find a better home 
in that new scheme.

The other character classification scheme I'm looking into is based on Unicode 
character properties. The reasons why I separated out this proposal are:

- Tools operating on ECMAScript source code need to be aware of the ECMAScript 
version they use, for syntax, semantics, keywords, and, well, the characters 
allowed in identifiers. Some tools let their clients specify an ECMAScript 
version (e.g., "es5" in JSLint and JSHint), others may assume a fixed version. 
The characters in turn are tied to both Unicode versions and ECMAScript 
versions - for example, SpiderMonkey currently supports Unicode 6.2 characters, 
but restricted to the BMP because it hasn't been upgraded to ES6 identifiers 
yet.

- For Unicode character properties, on the other hand, clients generally need 
only the properties as of the latest known version, and in the few exceptions 
that I know of (such as the 2003 version of IDNA) only specific Unicode 
versions are needed. Requiring that a general API for Unicode character 
properties provide access to Unicode version-specific information would create 
a huge burden on implementors, but benefit no-one.

- It's difficult for tools developers to determine the correct set of 
characters to include as identifier characters. One particular difficulty is 
that the Unicode general category of a character can change in rare cases, so a 
character can move into or out of the categories that the ES3/ES5 
specifications reference. For compatibility, characters shouldn't move out of 
the set of characters allowed for identifiers. (It turns out that browsers also 
get this wrong - all of them). (ES6 solves this problem by basing its 
identifier definition on Unicode Standard Annex 31, Unicode Identifier and 
Pattern Syntax, which defines special sets of characters Other_ID_Start and 
Other_ID_Continue and treats these characters as identifier characters even 
though their current general categories don't qualify them as such anymore.)

- For general Unicode processing, I think it's important to have support in 
regular expressions, because that's what many applications use for text 
processing. For tools operating on ECMAScript source code that seems less 
important, based on the data I collected [3].

So, rather than having one grand unified character classification API with 
support for both Unicode versions and regular expressions I think it's better 
to provide tailored APIs for different purposes.

[1] http://wiki.ecmascript.org/doku.php?id=harmony:modules_standard
[2] http://www.unicode.org/reports/tr31/#Backward_Compatibility
[3] http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification

Norbert


On Mar 9, 2013, at 9:16 , Allen Wirfs-Brock wrote:

> Norbert,
> 
> Can you explain why you think these should be  functions on String rather 
> than part of a more general character classification facility that might be 
> associated with some more specialized object?  The latter approach would seem 
> to be to have modularity advantages at both the implementation and usage 
> level.
> 
> Allen
> 
> 
> 
> 
> On Mar 7, 2013, at 11:35 PM, Norbert Lindenberg wrote:
> 
>> ECMAScript is used to implement a variety of tools that check code for 
>> conformance with the ECMAScript specification, minimize it, perform other 
>> transformations, or generate ECMAScript code. These tools have to be able to 
>> recognize ECMAScript identifiers, taking the identifier specification and 
>> the underlying Unicode specification into consideration - not quite easy 
>> given the ever-growing Unicode character set.
>> 
>> While looking at support for Unicode character properties in general, I 
>> realized that this use case is shaped differently from others, fundamental 
>> to ECMAScript, and amenable to a fairly simple solution, and so there's now 
>> a strawman:
>> http://wiki.ecmascript.org/doku.php?id=strawman:identifier_identification
>> 
>> I'd like to discuss this at next week's TC 39 meeting, but also invite 
>> earlier comments.
>> 
>> Thanks,
>> Norbert
>> 
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss@mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>> 
> 

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to