Re: New full Unicode for ES6 idea

2012-03-02 Thread Glenn Adams
On Fri, Mar 2, 2012 at 12:58 AM, Erik Corry erik.co...@gmail.com wrote:

 2012/3/1 Glenn Adams gl...@skynav.com:
  I'd like to plead for a solution rather like the one Java has, where
  strings are sequences of UTF-16 codes and there are specialized ways
  to iterate over them.  Looking at this entry from the Unicode FAQ:
  http://unicode.org/faq/char_combmark.html#7 there are different ways
  to describe the length (and iteration) of a string.  The BRS proposal
  favours #2, but I think for most applications utf-16-based-#1 is just
  fine, and for the applications that want to do it right #3 is almost
  always the correct solution.  Solution #3 needs library support in any
  case and has no problems with UTF-16.
 
  The central point here is that there are combining characters
  (accents) that you can't just normalize away.  Getting them right has
  a lot of the same issues as surrogate pairs (you shouldn't normally
  chop them up, they count as one 'character', you can't tell how many
  of them there are in a string without looking, etc.).  If you can
  handle combining characters then the surrogate pair support falls out
  pretty much for free.
 
 
  The problem here is that you are mixing apples and oranges. Although it
  *may* appear that surrogate pairs and grapheme clusters have features in
  common, they operate at different semantic levels entirely. A solution
 that
  attempts to conflate these two levels is going to cause problems at both
  levels. A distinction should be maintained between the following levels:
 
  (1) encoding units (e.g., UTF-16 coding units)
  (2) unicode scalar values (code points)
  (3) grapheme clusters

 This distinction is not lost on me.  I propose that random access
 indexing and .length in JS should work on level 1,


that's where we are today: indexing and length based on 16-bit code units
(of a UTF-16 encoding, likewise with Java)


 and there should be
 library support for levels 2 and 3.  In order of descending usefulness
 I think the order is 1, 3, 2.  Therefore I don't want to cause a lot
 of backwards compatibility headaches by prioritizing the efficient
 handling of level 2.


from a perspective of indexing Unicode characters, level 2 is the correct
place;

level 3 is useful for higher level, language/locale sensitive text
processing, but not particularly interesting at the basic ES string
processing level; we aren't talking about (or IMO should not be talking
about) a level 3 text processing library in this thread;
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-03-02 Thread Erik Corry
2012/3/2 Glenn Adams gl...@skynav.com:

 On Fri, Mar 2, 2012 at 12:58 AM, Erik Corry erik.co...@gmail.com wrote:

 2012/3/1 Glenn Adams gl...@skynav.com:
  I'd like to plead for a solution rather like the one Java has, where
  strings are sequences of UTF-16 codes and there are specialized ways
  to iterate over them.  Looking at this entry from the Unicode FAQ:
  http://unicode.org/faq/char_combmark.html#7 there are different ways
  to describe the length (and iteration) of a string.  The BRS proposal
  favours #2, but I think for most applications utf-16-based-#1 is just
  fine, and for the applications that want to do it right #3 is almost
  always the correct solution.  Solution #3 needs library support in any
  case and has no problems with UTF-16.
 
  The central point here is that there are combining characters
  (accents) that you can't just normalize away.  Getting them right has
  a lot of the same issues as surrogate pairs (you shouldn't normally
  chop them up, they count as one 'character', you can't tell how many
  of them there are in a string without looking, etc.).  If you can
  handle combining characters then the surrogate pair support falls out
  pretty much for free.
 
 
  The problem here is that you are mixing apples and oranges. Although it
  *may* appear that surrogate pairs and grapheme clusters have features in
  common, they operate at different semantic levels entirely. A solution
  that
  attempts to conflate these two levels is going to cause problems at both
  levels. A distinction should be maintained between the following levels:
 
  (1) encoding units (e.g., UTF-16 coding units)
  (2) unicode scalar values (code points)
  (3) grapheme clusters

 This distinction is not lost on me.  I propose that random access
 indexing and .length in JS should work on level 1,


 that's where we are today: indexing and length based on 16-bit code units
 (of a UTF-16 encoding, likewise with Java)

Not really for JS.  Missing parts in the current UTF-16 support have
been listed in this thread, eg in Norbert Lindenberg's 6 point
prioritization list, which I replied to yesterday.

 and there should be
 library support for levels 2 and 3.  In order of descending usefulness
 I think the order is 1, 3, 2.  Therefore I don't want to cause a lot
 of backwards compatibility headaches by prioritizing the efficient
 handling of level 2.


 from a perspective of indexing Unicode characters, level 2 is the correct
 place;

Yes, by definition.

 level 3 is useful for higher level, language/locale sensitive text

No, the Unicode grapheme clustering algorithm is not locale or
language sensitive
http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

 processing, but not particularly interesting at the basic ES string
 processing level; we aren't talking about (or IMO should not be talking
 about) a level 3 text processing library in this thread;

I will continue to feel free to talk about it as I believe that in the
cases where just indexing by UTF-16 words is not sufficient it is
normally level 3 that is the correct level.  Also, I think there
should be support for this level in JS as it is not locale-dependent.

-- 
Erik Corry
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-03-02 Thread Allen Wirfs-Brock


On Mar 1, 2012, at 11:09 PM, Norbert Lindenberg wrote:

 Comments:
 
 1) In terms of the prioritization I suggested a few days ago
 https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
 it seems you're considering item 6 essential, item 1 a side effect (whose 
 consequences are not mentioned - see below), items 2-5 nice to have. Do I 
 understand that correctly? What is this prioritization based on?

The main intent of this proposal was to push forward with including \u{ }in 
ES6, regardless of any other on going full Unicode related discussions we are 
having. Hopefully we can achieve more that that, but  if we don't the inclusion 
of  \u{ } now should make it easer the next time we attach that problem by 
reducing the use of \u\\u pairs which are ambiguous in intent.  My 
expectation is that we would tell the world that \{} is the new \u\u 
and that they should avoid using the latter form to inject supplementary 
characters into strings (and RegExp).

However, that usage depends upon the fact that today's implementations do 
generally allow supplementary characters to exist in the ECMAScript source code 
and that they do something rational with them.  ES5 botched this saying that 
source characters can't exist in ECMAScript source code so we also need to fix 
that.


 
 
 2) The description of the current situation seems incorrect. The strawman 
 says: As currently specified by ES5.1, supplementary characters cannot be 
 used in the source code of ECMAScript programs. I don't see anything in the 
 spec saying this. To the contrary, the following statement in clause 6 of the 
 spec opens the door to supplementary characters: If an actual source text is 
 encoded in a form other than 16-bit code units it must be processed as if it 
 was first converted to UTF-16. Actual source text outside of an ECMAScript 
 runtime is rarely stored in streams of 16-bit code units; it's normally 
 stored and transmitted in UTF-8 (including its subset ASCII) or some other 
 single-byte or multi-byte character encoding. Interpreting source text 
 therefore almost always requires conversion to UTF-16 as a first step. UTF-8 
 and several other encodings (GB18030, Big5-HKSCS, EUC-TW) can represent 
 supplementary characters, and correct conversion to UTF-16 will convert them 
 to surrogate pairs.
 
 When I mentioned this before, you said that the intent of the ES5 wording was 
 to keep ECMAScript limited to the BMP (the UCS-2 world).
 https://mail.mozilla.org/pipermail/es-discuss/2011-May/014337.html
 https://mail.mozilla.org/pipermail/es-discuss/2011-May/014342.html
 However, I don't see that intent reflected in the actual text of clause 6.
 
 I have since also tested with supplementary characters in UTF-8 source text 
 on a variety of current browsers (Safari / (Mac, iOS), (Firefox, Chrome, 
 Opera) / (Mac, Windows), Explorer / Windows), and they all handle the 
 conversion from UTF-8 to UTF-16 correctly. Do you know of one that doesn't? 
 The only ECMAScript implementation I encountered that fails here is Node.js.

http://code.google.com/p/v8/issues/detail?id=761 suggests that V8 truncates 
supplementary characters rather than converting them to surrogate pairs.  
However, it is unclear whether that is referring to literal strings in the 
source code or only computationally generated strings. 
 
 In addition to plain text encoding in UTF-8, supplementary characters can 
 also be represented in source code as a sequence of two Unicode escapes. It's 
 not as convenient, but it works in all implementations I've tested, including 
 Node.js.

the main problem is 
  SourceCharacter :: 
 any Unicode code unit

and ...the phrase 'code unit'  and the word 'character' will be used to refer 
to a 16-bit unsigned value...

All of the lexical rules in clause 7 are defined in terms of characters (ie 
code units).  So, for example, a supplementary characters in category Lo 
occurring in an Identifier context would, at best, be seen as a pair of code 
units neither of which are in categories that are valid for IdentifierPart so 
the identifier would be invalid.   Similarly a pair of \u escapes 
representing such a character would also be lex'ed as two distinct characters 
and result in an invalid identifier. 

Regarding the intent of the current wording, I' was speaking of my intent when 
I was actually editing that text for the ES5 spec.  My understanding at the 
time was that the lexical alphabet of ECMAScript was 16-bit code units and I 
was trying to clarify that but I think I botched it. In reality, I think that 
understanding is actually still correct in that there is nothing in the lexical 
grammar, as I noted in the previous paragraph that deals with anything other 
than 16-bit code units. Any conversions from non 16-bit character encodings  is 
something that logically happens prior to processing as ECMAScript source 
code.  

 
 
 3) Changing the source code to be just a stream of 

Re: New full Unicode for ES6 idea

2012-03-02 Thread Glenn Adams
On Fri, Mar 2, 2012 at 2:13 AM, Erik Corry erik.co...@gmail.com wrote:

  level 3 is useful for higher level, language/locale sensitive text

 No, the Unicode grapheme clustering algorithm is not locale or
 language sensitive
 http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries


one final comment: the Unicode algorithm is intended to define default
behavior only:

This specification defines *default* mechanisms; more sophisticated
implementations can *and should* tailor them for particular locales or
environments.

it specifically states that implementations *should* provide
language/locale sensitive behavior;

in actual text processing usage, one needs the language/locale sensitive
behavior in most cases, not a default behavior

G.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-03-02 Thread Brendan Eich

Glenn Adams wrote:


On Fri, Mar 2, 2012 at 2:13 AM, Erik Corry erik.co...@gmail.com 
mailto:erik.co...@gmail.com wrote:


 level 3 is useful for higher level, language/locale sensitive text

No, the Unicode grapheme clustering algorithm is not locale or
language sensitive
http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries


one final comment: the Unicode algorithm is intended to define default 
behavior only:


This specification defines /default/ mechanisms; more sophisticated 
implementations can /and should/ tailor them for particular locales or 
environments.


it specifically states that implementations *should* provide 
language/locale sensitive behavior;


in actual text processing usage, one needs the language/locale 
sensitive behavior in most cases, not a default behavior
Right, and Gecko, WebKit, and other web rendering engines obviously need 
to care about this. But they're invariably implemented mostly or wholly 
*not* in JS.


It's a bit ambitious for JS to have such facilities.

I agree with Erik that the day may come, but ES6 is being prototyped and 
spec'ed and we need to be done in 2012 to have the spec ready for 2013. 
We should not overreach.


/be

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-03-01 Thread Erik Corry
I'm not in favour of big red switches, and I don't think the
compartment based solution is going to be workable.

I'd like to plead for a solution rather like the one Java has, where
strings are sequences of UTF-16 codes and there are specialized ways
to iterate over them.  Looking at this entry from the Unicode FAQ:
http://unicode.org/faq/char_combmark.html#7 there are different ways
to describe the length (and iteration) of a string.  The BRS proposal
favours #2, but I think for most applications utf-16-based-#1 is just
fine, and for the applications that want to do it right #3 is almost
always the correct solution.  Solution #3 needs library support in any
case and has no problems with UTF-16.

The central point here is that there are combining characters
(accents) that you can't just normalize away.  Getting them right has
a lot of the same issues as surrogate pairs (you shouldn't normally
chop them up, they count as one 'character', you can't tell how many
of them there are in a string without looking, etc.).  If you can
handle combining characters then the surrogate pair support falls out
pretty much for free.

Advantages of my proposal:

* High level of backwards compatibility
* No issues of where to place the BRS
* Compact and simple in the implementation
* Can be polyfilled on most VMs
* Interaction with the DOM is unproblematic
* No issues of what happens on concatenation if a surrogate pair is created.

Details:

* The built in string charCodeAt, [], length operations work in terms of UTF-16
* String.fromCharCode(x) can return a string with a length of 2
* New object StringIterator

new StringIterator(backing) returns a string iterator.  The iterator
has the following methods:

hasNext();  // Returns this.index() != this.backing().length
nextGrapheme();  // Returns the next grapheme as a unicode code point,
or -1 if the next grapheme is a sequence of code points
nextGraphemeArray(); // Returns an array of numeric code points
(possibly just one) representing the next grapheme
nextCodePoint(); // Returns the next code point, possibly consuming
two surrogate pairs
index();  // Gets the current index in the string, from 0 to length
setIndex();  // Sets the current index in the string, from 0 to length
backing();  // Get the backing string

// Optionally
hasPrevious();
previous*();  // Analogous to nextGrapheme etc.
codePointLength(); // Takes O(length), cache the answer if you care
graphemeLength();  // Ditto

If any of the next.. functions encounter an unmatched half of a
surrogate pair they just return its number.

Regexp support.  Regexps act 'as if' the following steps were performed.

Outside character classes an extended character turns into (?:xy)
where x and y are the surrogate pairs.
Inside positive character classes the extended characters are
extracted so [abz] becomes (?:[ab]|xy) where z is an extended
character and x and y are the surrogate pairs.
Negative character classes can be handled by transforming into
negative lookaheads.
A decent set of unicode character classes will likely subsume most
uses of these transformations.

Perhaps the BRS 21 bit solution feels marginally cleaner, but having
two different kinds of strings in the same VM feels like a horrible
solution that is user visible and will haunt implementations forever,
and the cleanliness difference is very marginal given that grapheme
based iteration is the correct solution for almost all the cases where
iterating over utf-16 codes is not good enough.

-- 
Erik Corry

2012/2/20 Phillips, Addison addi...@lab126.com:
 Mark wrote:



 First, it would be great to get full Unicode support in JS. I know that's
 been a problem for us at Google.



 AP +1: I think we’ve waited for supplementary character support long
 enough!



 Secondly, while I agree with Addison that the approach that Java took is
 workable, it does cause problems.



 AP The tension is between “compatibility” and “ease of use” here, I think.
 The question is whether very many scripts depend on the ‘uint16’ nature of a
 character in ES, use surrogates to effect supplementary character support,
 or are otherwise tied to the existing encoding model and are broken as a
 result of changes. In its ideal form, an ES string would logically be a
 sequence of Unicode characters (code points) and only the internal
 representation would worry about whatever character encoding scheme made the
 most sense (in many cases, this might actually be UTF-16).



 AP … but what I think is hard to deal with are different modes of
 processing scripts depending on “fullness of the Unicode inside”.
 Admittedly, the approach I favor is rather conservative and presents a
 number of challenges, most notably in adapting regex or for users who want
 to work strictly in terms of character values.



 There are good reasons for why Java did what it did, basically for
 compatibility. But if there is some way that JS can work around those,
 that'd be great.



 AP Yes, it would.



 ~Addison




Re: New full Unicode for ES6 idea

2012-03-01 Thread Erik Corry
2012/2/22 Norbert Lindenberg ecmascr...@norbertlindenberg.com:
 I'll reply to Brendan's proposal in two parts: first about the goals for 
 supplementary character support, second about the BRS.

 Full 21-bit Unicode support means all of:

 * indexing by characters, not uint16 storage units;
 * counting length as one greater than the last index; and
 * supporting escapes with (up to) six hexadecimal digits.

 For me, full 21-bit Unicode support has a different priority list.

 First come the essentials: Regular expressions; functions that interpret 
 strings; the overall sense that all Unicode characters are supported.

 1) Regular expressions must recognize supplementary characters as atomic 
 entities, and interpret them according to Unicode semantics.

 Look at the contortions one has to go through currently to describe a simple 
 character class that includes supplementary characters:
 https://github.com/roozbehp/yui3-gallery/blob/master/src/gallery-intl-bidi/js/intl-bidi.js

 Read up on why it has to be done this way, and see to what extremes some 
 people are going to make supplementary characters work despite ECMAScript:
 http://inimino.org/~inimino/blog/javascript_cset

 Now, try to figure out how you'd convert a user-entered string to a regular 
 expression such that you can search for the string without case distinction, 
 where the string may contain supplementary characters such as жвь (Deseret 
 for one).

 Regular expressions matter a lot here because, if done properly, they 
 eliminate much of the need for iterating over strings manually.

 2) Built-in functions that interpret strings have to recognize supplementary 
 characters as atomic entities and interpret them according to their Unicode 
 semantics. The list of functions in ES5 that violate this principle is 
 actually rather short: Besides the String functions relying on regular 
 expressions (match, replace, search, split), they're the String case 
 conversion functions (toLowerCase, toLocaleLowerCase, toUpperCase, 
 toLocaleUpperCase) and the relational comparison for strings (11.8.5). But 
 the principle is also important for new functionality being considered for 
 ES6 and above.

 3) It must be clear that the full Unicode character set is allowed and 
 supported. This means at least getting rid of the reference to UCS-2 (clause 
 2) and the bizarre equivalence between characters and UTF-16 code units 
 (clause 6). ECMAScript has already defined several ways to create UTF-16 
 strings containing supplementary characters (parsing UTF-8 source; using 
 Unicode escapes for surrogate pairs), and lets applications freely pass 
 around such strings. Browsers have surrounded ECMAScript implementations with 
 text input, text rendering, DOM APIs, and XMLHTTPRequest with full Unicode 
 support, and generally use full UTF-16 to exchange text with their ECMAScript 
 subsystem. Developers have used this to build applications that support 
 supplementary characters, hacking around the remaining gaps in ECMAScript as 
 seen above. But, as in the bug report that Brendan pointed to this morning 
 (http://code.google.com/p/v8/issues/detail?id=761), the mention of UCS-2 is 
 still used by some to excuse bugs.

I agree that these are the priorities and should be done, including
reopening and fixing the V8 bug.

 Only after these essentials come the niceties of String representation and 
 Unicode escapes:

 4) 1 String element to 1 Unicode code point is indeed a very nice and 
 desirable relationship. Unlike Java, where binary compatibility between 
 virtual machines made a change from UTF-16 to UTF-32 impossible, JavaScript 
 needs to be compatible only at the source code level - or maybe, with a BRS, 
 not even that.

I don't think this is important enough to justify
incompatibility/implementation  pain.

Agree with your points 5 and 6.  One extra point of my own:

* I think we should prefer transparency in cases where there is doubt.
 This means passing data through with no errors or changes.  It means
allowing half surrogate pairs, combining characters that have nothing
to combine with and characters that are not currently assigned in
Unicode.  In an ideal world, it's hard to see why these happen, but in
the cases where they happen the most helpful thing to do is almost
always to allow/ignore them.

Here are two hyptothetical examples:

We get data from a source that chops up a UTF-16 text into chunks and
sends them separately for transmission.  This will result in unmatched
pairs of surrogates, but as long as our applications transmits the
data unchanged, no harm results after they are recombined later.

Take an XML format where all the tags are ASCII, but there is body
text that contains floating point numbers encoded as 16 bit values,
including malformed surrogate pairs.  This is pretty sick, but who are
we to judge?  We want to treat this as a string because we can use
string operations on the XML tags, but it would be extremely unhelpful
to 

Re: New full Unicode for ES6 idea

2012-03-01 Thread Norbert Lindenberg
Comments:

1) In terms of the prioritization I suggested a few days ago
https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
it seems you're considering item 6 essential, item 1 a side effect (whose 
consequences are not mentioned - see below), items 2-5 nice to have. Do I 
understand that correctly? What is this prioritization based on?


2) The description of the current situation seems incorrect. The strawman says: 
As currently specified by ES5.1, supplementary characters cannot be used in 
the source code of ECMAScript programs. I don't see anything in the spec 
saying this. To the contrary, the following statement in clause 6 of the spec 
opens the door to supplementary characters: If an actual source text is 
encoded in a form other than 16-bit code units it must be processed as if it 
was first converted to UTF-16. Actual source text outside of an ECMAScript 
runtime is rarely stored in streams of 16-bit code units; it's normally stored 
and transmitted in UTF-8 (including its subset ASCII) or some other single-byte 
or multi-byte character encoding. Interpreting source text therefore almost 
always requires conversion to UTF-16 as a first step. UTF-8 and several other 
encodings (GB18030, Big5-HKSCS, EUC-TW) can represent supplementary characters, 
and correct conversion to UTF-16 will convert them to surrogat
 e pairs.

When I mentioned this before, you said that the intent of the ES5 wording was 
to keep ECMAScript limited to the BMP (the UCS-2 world).
https://mail.mozilla.org/pipermail/es-discuss/2011-May/014337.html
https://mail.mozilla.org/pipermail/es-discuss/2011-May/014342.html
However, I don't see that intent reflected in the actual text of clause 6.

I have since also tested with supplementary characters in UTF-8 source text on 
a variety of current browsers (Safari / (Mac, iOS), (Firefox, Chrome, Opera) / 
(Mac, Windows), Explorer / Windows), and they all handle the conversion from 
UTF-8 to UTF-16 correctly. Do you know of one that doesn't? The only ECMAScript 
implementation I encountered that fails here is Node.js.

In addition to plain text encoding in UTF-8, supplementary characters can also 
be represented in source code as a sequence of two Unicode escapes. It's not as 
convenient, but it works in all implementations I've tested, including Node.js.


3) Changing the source code to be just a stream of Unicode characters seems a 
good idea overall. However, just changing the definition of SourceCharacter is 
going to break things. SourceCharacter isn't only used for source syntax and 
JSON syntax, where the change seems benign; it's also used to define the 
content of String values and the interpretation of regular expression patterns:
- Subclause 7.8.4 contains the statements The SV of DoubleStringCharacters :: 
DoubleStringCharacter is a sequence of one character, the CV of 
DoubleStringCharacter. and The CV of DoubleStringCharacter :: SourceCharacter 
but not one of  or \ or LineTerminator is the SourceCharacter character 
itself. If SourceCharacter becomes a Unicode character, then this means 
coercing a 21-bit code point into a single 16-bit code unit, and that's not 
going to end well.
- Subclauses 15.10.1 and 15.10.2 use SourceCharacter to define 
PatternCharacter, IdentityEscape, RegularExpressionNonTerminator, 
ClassAtomNoDash. While this could potentially be part of a set of changes to 
make regular expression correctly support full Unicode, by itself it means that 
21-bit code points will be coerced into or compared against 16-bit code units. 
Changing regular expressions to be code-point based has some compatibility risk 
which we need to carefully evaluate.


4) The statement about UnicodeEscapeSequence: This production is limited to 
only expressing 16-bit code point values. is incorrect. Unicode escape 
sequences express 16-bit code units, not code points (remember that any use of 
the word character without the prefix Unicode in the spec after clause 6 
means 16-bit code unit). A supplementary character can be represented in 
source code as a sequence of two Unicode escapes. The proposed new Unicode 
escape syntax is more convenient and more legible, but doesn't provide new 
functionality.


5) I don't understand the sentence For that reason, it is impossible to know 
for sure whether pairs of existing 16-bit Unicode escapes are intended to 
represent a single logical character or an explicit two character UTF-16 
encoding of a Unicode characters. - what do you mean by an explicit two 
character UTF-16 encoding of a Unicode characters? In any case, it seems 
pretty clear to me that a Unicode escape for a high surrogate value followed by 
a Unicode escape for a low surrogate value, with the spec based on 16-bit 
values, means a surrogate pair representing a supplementary character. Even if 
the system were then changed to be 32-bit based, it's hard to imagine that the 
intent was to create a sequence of two invalid code points.

Norbert


On Feb 

Re: New full Unicode for ES6 idea

2012-03-01 Thread Erik Corry
2012/3/1 Glenn Adams gl...@skynav.com:

 2012/3/1 Erik Corry erik.co...@gmail.com

 I'm not in favour of big red switches, and I don't think the
 compartment based solution is going to be workable.

 I'd like to plead for a solution rather like the one Java has, where
 strings are sequences of UTF-16 codes and there are specialized ways
 to iterate over them.  Looking at this entry from the Unicode FAQ:
 http://unicode.org/faq/char_combmark.html#7 there are different ways
 to describe the length (and iteration) of a string.  The BRS proposal
 favours #2, but I think for most applications utf-16-based-#1 is just
 fine, and for the applications that want to do it right #3 is almost
 always the correct solution.  Solution #3 needs library support in any
 case and has no problems with UTF-16.

 The central point here is that there are combining characters
 (accents) that you can't just normalize away.  Getting them right has
 a lot of the same issues as surrogate pairs (you shouldn't normally
 chop them up, they count as one 'character', you can't tell how many
 of them there are in a string without looking, etc.).  If you can
 handle combining characters then the surrogate pair support falls out
 pretty much for free.


 The problem here is that you are mixing apples and oranges. Although it
 *may* appear that surrogate pairs and grapheme clusters have features in
 common, they operate at different semantic levels entirely. A solution that
 attempts to conflate these two levels is going to cause problems at both
 levels. A distinction should be maintained between the following levels:

 encoding units (e.g., UTF-16 coding units)
 unicode scalar values (code points)
 grapheme clusters

This distinction is not lost on me.  I propose that random access
indexing and .length in JS should work on level 1, and there should be
library support for levels 2 and 3.  In order of descending usefulness
I think the order is 1, 3, 2.  Therefore I don't want to cause a lot
of backwards compatibility headaches by prioritizing the efficient
handling of level 2.


 IMO, the current discussion should limit itself to the interface between the
 first and second of these levels, and not introduce the third level into the
 mix.

 G.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-29 Thread Allen Wirfs-Brock
I posted a new stawman that describes what I think should is that most minimal 
support that we must provide for full unicode in ES.next: 
http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code 

I'm not suggesting that we must stop at this level of support, but I think not 
doing at least what is describe in this proposal would would be mistake.

Thoughts?


Allen



On Feb 28, 2012, at 3:49 AM, Brendan Eich wrote:

 Wes Garland wrote:
 If four-byte escapes are statically rejected in BRS-on, we have a problem -- 
 we should be able to use old code that runs in either mode unchanged when 
 said code only uses characters in the BMP.
 
 We've been over this and I conceded to Allen that four-byte escapes (I'll 
 use \u to be clear from now on) must work as today with BRS-on. Otherwise 
 we make it hard to impossible to migrate code that knows what it is doing 
 with 16-bit code units that round-trip properly.
 
 Accepting both 4 and 6 byte escapes is a problem, though -- what is 
 \u123456.length?  1 or 3?
 
 This is not a problem. We want .length to distribute across concatenation, so 
 3 is the only answer and in particular (\u1234 + \u5678).length === 2 
 irrespective of BRS.
 
 If we accept \u1234 in BRS-on as a string with length 5 -- as we do today 
 in ES5 with \u123.length===4 -- we give developers a way to feature-test 
 and conditionally execute code, allowing libraries to run with BRS-on and 
 BRS-off.
 
 Feature-testing should be done using a more explicit test. API TBD, but I 
 don't think breaking \u with BRS on is a good idea.
 
 I agree with you that Roozbeh is hardly used, so it can take the hit of 
 having to feature-test the BRS. The much more common case today is JS code 
 that blithely ignores non-BMP characters that make it into strings as pairs, 
 treating them blindly as two characters (ugh; must purge that c-word 
 abusage from the spec).
 
 /be
 

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-28 Thread Brendan Eich

Wes Garland wrote:
If four-byte escapes are statically rejected in BRS-on, we have a 
problem -- we should be able to use old code that runs in either mode 
unchanged when said code only uses characters in the BMP.


We've been over this and I conceded to Allen that four-byte escapes 
(I'll use \u to be clear from now on) must work as today with 
BRS-on. Otherwise we make it hard to impossible to migrate code that 
knows what it is doing with 16-bit code units that round-trip properly.


Accepting both 4 and 6 byte escapes is a problem, though -- what is 
\u123456.length?  1 or 3?


This is not a problem. We want .length to distribute across 
concatenation, so 3 is the only answer and in particular (\u1234 + 
\u5678).length === 2 irrespective of BRS.


If we accept \u1234 in BRS-on as a string with length 5 -- as we do 
today in ES5 with \u123.length===4 -- we give developers a way to 
feature-test and conditionally execute code, allowing libraries to run 
with BRS-on and BRS-off.


Feature-testing should be done using a more explicit test. API TBD, but 
I don't think breaking \u with BRS on is a good idea.


I agree with you that Roozbeh is hardly used, so it can take the hit of 
having to feature-test the BRS. The much more common case today is JS 
code that blithely ignores non-BMP characters that make it into strings 
as pairs, treating them blindly as two characters (ugh; must purge 
that c-word abusage from the spec).


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-24 Thread Brendan Eich

Norbert Lindenberg wrote:

OK - migrations are hard. But so far most participants have only seen 
additional work, no benefits. How long will this take? When will it end? When 
will browsers make BRS-on the default, let alone eliminate the switch? When can 
Roozbeh abandon his original version? Where's the blue button?


It may be that the BRS is worse than an incompatible change to full 
Unicode as Allen proposed last year. But in either case, something gets 
harder for Roozbeh. Which is worse?


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-22 Thread Wes Garland
Erratum:

var a = [0];

should read

var a = [];
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Andrew Oakley
On 02/20/12 16:47, Brendan Eich wrote:
 Andrew Oakley wrote:
 Issues only arise in code that tries to treat a string as an array of
 16-bit integers, and I don't think we should be particularly bothered by
 performance of code which misuses strings in this fashion (but clearly
 this should still work without opt-in to new string handling).
 
 This is all strings in JS and the DOM, today.
 
 That is, we do not have any measure of code that treats strings as
 uint16s, forges strings using \u, etc. but the ES and DOM specs
 have allowed this for  14 years. Based on bitter experience, it's
 likely that if we change by fiat to 21-bit code points from 16-bit code
 units, some code on the Web will break.

Sorry, I don't think I was particularly clear.  The point I was trying
to make is that we can *pretend* that code points are 16-bit but
actually use a 21-bit representation internally.  If content requests
proper Unicode support we simply switch to allowing 21-bit code-points
and stop encoding characters outside the BMP using surrogate pairs
(because the characters now fit in a single code point).

 And as noted in the o.p. and in the thread based on Allen's proposal
 last year, browser implementations definitely count on representation
 via array of 16-bit integers, with length property or method counting same.
 
 Breaking the Web is off the table. Breaking implementations, less so.
 I'm not sure why you bring up UTF-8. It's good for encoding and decoding
 but for JS, unlike C, we want string to be a high level full Unicode
 abstraction. Not bytes with bits optionally set indicating more bytes
 follow to spell code points.

Yes, I probably shouldn't have brought up UTF-8 (we do store strings
using UTF-8, I was thinking about our own implementation). The intention
was not to break the web, my comments about issues when strings were
misused were purely *performance* concerns, behaviour would otherwise
remain unchanged (unless full Unicode support had been enabled).

-- 
Andrew Oakley
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Wes Garland
On 21 February 2012 00:03, Brendan Eich bren...@mozilla.com wrote:

 These are byte-based enodings, no? What is the problem inflating them by
 zero extension to 16 bits now (or 21 bits in the future)? You can't make an
 invalid Unicode character from a byte value.


One of my examples, GB 18030, is a four-byte encoding and a Chinese
government standard.  It is a mapping onto Unicode, but this mapping is
table-driven rather than algorithm driven like the UTF-* transport
formats.  To provide a single example, Unicode 0x2259 maps onto GB 18030
0x8136D830.

You're right about Big5 being byte-oriented, maybe this was a bad example,
although it is a double-byte charset. It works by putting ASCII down low
making bytes above 0x7f escapes into code pages dereferenced by the next
byte.  Each code point is encoded with one or two bytes, never more.  If I
were developing with Big5 in JS, I would store the byte stream 4a 4b d8 00
c1 c2 4c as  004a 004b d800 c1c2 004c.  This would allow me to use JS
regular expressions and so on.

Anyway, Big5 punned into JS strings (via a C or C++ API?) is *not* a strong
 use-case for ignoring invalid characters.


Agreed - I'm stretching to see if I can stretch far enough to find a real
problem with BRS -- because I really want it.

But the data does not need to arrive from C API -- it could easily be
delivered by an XHR request where, say, the remote end dumps database rows
into a transport format based around evaluating JS string literals (like
JSON).

Ball one. :-P


If I hit the batter, does he get to first base?

We still haven't talked about equality and normalization, I suppose that
can wait.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich

Andrew Oakley wrote:

On 02/20/12 16:47, Brendan Eich wrote:

  Andrew Oakley wrote:

  Issues only arise in code that tries to treat a string as an array of
  16-bit integers, and I don't think we should be particularly bothered by
  performance of code which misuses strings in this fashion (but clearly
  this should still work without opt-in to new string handling).
  
  This is all strings in JS and the DOM, today.
  
  That is, we do not have any measure of code that treats strings as

  uint16s, forges strings using \u, etc. but the ES and DOM specs
  have allowed this for  14 years. Based on bitter experience, it's
  likely that if we change by fiat to 21-bit code points from 16-bit code
  units, some code on the Web will break.


Sorry, I don't think I was particularly clear.  The point I was trying
to make is that we can*pretend*  that code points are 16-bit but
actually use a 21-bit representation internally.


So far, that's like Allen's proposal from last year 
(http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings). 
But you didn't say how iteration (indexing and .length) work.



If content requests
proper Unicode support we simply switch to allowing 21-bit code-points
and stop encoding characters outside the BMP using surrogate pairs
(because the characters now fit in a single code point).


How does content request proper Unicode support? Whatever that gesture 
is, it's big and red ;-). But we don't have such a switch or button to 
press like that, yet.


If a .js or .html file as fetched from a server has a UTF-8 encoding, 
indeed non-BMP characters in string literals will be transcoded in 
open-source browsers and JS engines that use uint16 vectors internally, 
but each part of the surrogate pair will take up one element in the 
uint16 vector. Let's take this now as a content request to use full 
Unicode. But the .js file was developed 8 years ago and assumes two code 
units, not one. It hardcodes for that assumption, somehow (indexing, 
.length exact value, indexOf('\ud800'), etc.). It is now broken.


And non-literal non-BMP characters won't be helped by transcoding 
differently when the .js or .html file is fetched. They'll just change 
size at runtime.


/be

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich

Brendan Eich wrote:

in open-source browsers and JS engines that use uint16 vectors internally


Sorry, that reads badly. All I meant is that I can't tell what 
closed-source engines do, not that they do not comply with ECMA-262 
combined with other web standards to have the same observable effect, 
e.g. Allen's example:


var c =  // where the single character between the quotes is the 
Unicode character U+1f638


c.length == 2;
c === \ud83d\ude38; //the two character UTF-16 encoding of 0x1f683
c.charCodeAt(0) == 0xd83d;
c.charCodeAt(1) == 0xd338;

Still no BRS to set, we need one if we want a full-Unicode outcome 
(c.length == 1, etc.).


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: New full Unicode for ES6 idea

2012-02-21 Thread Phillips, Addison
 
 Normalization happens to source upstream of the JS engine. Here I'll call on a
 designated Unicode hitter. ;-)
 

I agree that Unicode Normalization shouldn't happen automagically in the JS 
engine. I rather doubt that normalization happens to source upstream of the JS 
engine, unless by upstream you mean best see to the normalization yourself.

By contrast, providing a method for normalizing strings would be useful.

Addison
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich

Phillips, Addison wrote:

Normalization happens to source upstream of the JS engine. Here I'll call on a
designated Unicode hitter. ;-)



I agree that Unicode Normalization shouldn't happen automagically in the JS engine. I rather doubt that 
normalization happens to source upstream of the JS engine, unless by upstream you 
mean best see to the normalization yourself.


Yes ;-).

I meant ECMA-262 punts source normalization upstream in the spec 
pipeline that runs parallel to the browser's loading-the-URL | 
processing-what-was-loaded pipeline. ECMA-262 is concerned only with its 
little slice of processing heaven.



By contrast, providing a method for normalizing strings would be useful.

/summon Norbert.

/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: New full Unicode for ES6 idea

2012-02-21 Thread Phillips, Addison
 
 I meant ECMA-262 punts source normalization upstream in the spec pipeline
 that runs parallel to the browser's loading-the-URL | processing-what-was-
 loaded pipeline. ECMA-262 is concerned only with its little slice of 
 processing
 heaven.

Yep. One of the problems is that the source script may not be using a Unicode 
encoding or may be using a Unicode encoding and be serialized in a 
non-normalized form. Your slice of processing heaven treats 
Unicode-normalization-equivalent-yet-different-codepoint-sequence tokens as 
unequal. Not that this is a bad thing.
 
  By contrast, providing a method for normalizing strings would be useful.
 /summon Norbert.

(hides the breakables, listens for thunder)

Addison

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: New full Unicode for ES6 idea

2012-02-21 Thread Phillips, Addison
Because it has always been possible, it’s difficult to say how many scripts 
have transported byte-oriented data by “punning” the data into strings. 
Actually, I think this is more likely to be truly binary data rather than text 
in some non-Unicode character encoding, but anything is possible, I suppose. 
This could include using non-character values like “FFFE”, “” in addition 
to the surrogates. A BRS-running implementation would break a script that 
relied on String being a sequence of 16-bit unsigned integer values with no 
error checking.


One of my examples, GB 18030, is a four-byte encoding and a Chinese government 
standard.  It is a mapping onto Unicode, but this mapping is table-driven 
rather than algorithm driven like the UTF-* transport formats.  To provide a 
single example, Unicode 0x2259 maps onto GB 18030 0x8136D830.
AP GB 18030 is more complex than that. Not all characters are four-byte, for 
example. As a multibyte encoding, you might choose to “pun” GB 18030 into a 
String as 81 36 d8 30. There isn’t much attraction to punning it into 0x8136 
0xd830, but, as noted above, someone might be foolish enough to try it ;-). 
Scripts that rely on this probably break under BRS.

You're right about Big5 being byte-oriented, maybe this was a bad example, 
although it is a double-byte charset. It works by putting ASCII down low making 
bytes above 0x7f escapes into code pages dereferenced by the next byte.  Each 
code point is encoded with one or two bytes, never more.  If I were developing 
with Big5 in JS, I would store the byte stream 4a 4b d8 00 c1 c2 4c as  004a 
004b d800 c1c2 004c.  This would allow me to use JS regular expressions and so 
on.
Not exactly. The trailing bytes in Big5 start at 0x40, for example. But it is 
certainly the case that some multibyte characters in Big5 happen to have the 
same byte-pair as a surrogate code point (when considered as a pair of bytes) 
or other non-character in the Unicode BMP, and one might (he says, squinting 
really hard) want to do as you suggest and record the multibyte sequence as a 
single code point.
But the data does not need to arrive from C API -- it could easily be delivered 
by an XHR request where, say, the remote end dumps database rows into a 
transport format based around evaluating JS string literals (like JSON).
Allowing isolated invalid sequences isn’t actually the problem, if you think 
about it. Yes, the data is bad and yes you can’t view it cleanly. But you can 
do whatever you need to on it.
The problem is when you intend to store two values that end up as a single 
character. If I have a string with code points “f235 5e7a e040 d800”, the d800 
does no particular harm. The problem is: if I construct a BRS string using that 
sequence and then concatenate the sequence “dc00 a053 3254” onto it, the 
resulting string is only *six* characters long, rather than the expected seven, 
since presumably the d800 dc00 pair turns into U+1.
Addison
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich

Phillips, Addison wrote:


Because it has always been possible, it’s difficult to say how many 
scripts have transported byte-oriented data by “punning” the data into 
strings. Actually, I think this is more likely to be truly binary data 
rather than text in some non-Unicode character encoding, but anything 
is possible, I suppose. This could include using non-character values 
like “FFFE”, “” in addition to the surrogates. A BRS-running 
implementation would break a script that relied on String being a 
sequence of 16-bit unsigned integer values with no error checking.




Allen's view of the BRS-enabled semantics would have 16-bit GIGO 
without exceptions -- you'd be storing 16-bit values, whatever their 
source (including \u literals spelling invalid characters and 
unmatched surrogates) in at-least-21-bit elements of strings, and 
reading them back.


My concern and reason for advocating early or late errors on shenanigans 
was that people today writing surrogate pais literally and then taking 
extra pains in JS or C++ (whatever the host language might be) to 
process them as single code points and characters would be broken by the 
BRS-enabled behavior of separating the parts into distinct code points.


But that's pessimistic. It could happen, but OTOH anyone coding 
surrogate pairs might want them to read back piece-wise when indexing. 
In that case what Allen proposes, storing each formerly 16-bit code 
unit, however expressed, in the wider 21-or-more-bits unit, and reading 
back likewise, would just work.


Sorry if this is all obvious. Mainly I want to throw in my lot with 
Allen's exception-free literal/constructor approach. The encoding APIs 
should throw on invalid Unicode but literals and strings as immutable 
16-bit storage buffers should work as today.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Allen Wirfs-Brock

On Feb 21, 2012, at 7:37 AM, Brendan Eich wrote:

 Brendan Eich wrote:
 in open-source browsers and JS engines that use uint16 vectors internally
 
 Sorry, that reads badly. All I meant is that I can't tell what closed-source 
 engines do, not that they do not comply with ECMA-262 combined with other web 
 standards to have the same observable effect, e.g. Allen's example:

A quick scan of http://code.google.com/p/v8/issues/detail?id=761 suggests that 
there may be more variability among current browsers than we thought.  I 
haven't tried my original test case in Chrome of IE9 but the discussion in this 
bug report suggests that their behavior may currently be different from FF.

 
 var c =  // where the single character between the quotes is the Unicode 
 character U+1f638
 
 c.length == 2;
 c === \ud83d\ude38; //the two character UTF-16 encoding of 0x1f683
 c.charCodeAt(0) == 0xd83d;
 c.charCodeAt(1) == 0xd338;
 
 Still no BRS to set, we need one if we want a full-Unicode outcome (c.length 
 == 1, etc.).
 
 /be
 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss
 

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Tab Atkins Jr.
On Tue, Feb 21, 2012 at 3:11 PM, Brendan Eich bren...@mozilla.com wrote:
 Hi Mark, thanks for this post.
 Mark Davis ☕ wrote:

 UTF-8 represents a code point as 1-4 8-bit code units

 1-6.
...
 Lock up your encoders, I am so not a Unicode guru but this is what my
 reptile coder brain remembers.

Only theoretically.  UTF-8 has been locked down to the same range that
UTF-16 has (RFC 3629), so the largest real character you'll see is 4
bytes, as that gives you exactly 21 bits of data.

~TJ
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: New full Unicode for ES6 idea

2012-02-21 Thread Phillips, Addison
 
 Hi Mark, thanks for this post.
 
 Mark Davis ☕ wrote:
  UTF-8 represents a code point as 1-4 8-bit code units
 
 1-6.

No. 1 to *4*. Five and six byte UTF-8 sequences are illegal and invalid. 

 
  UTF-16 represents a code point  as 2 or 4 16-bit code units
 
 1 or 2.

Yes, 1 or 2 16-bit code units (that's 2 or 4 bytes, of course).

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.




___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich
Thanks, all! That's a relief to know, six bytes always seemed to long 
but my reptile coder brain was also reptile-coder-lazy and I never dug 
into it.


/be

Phillips, Addison wrote:

Hi Mark, thanks for this post.

Mark Davis ☕ wrote:

UTF-8 represents a code point as 1-4 8-bit code units

1-6.


No. 1 to *4*. Five and six byte UTF-8 sequences are illegal and invalid.


UTF-16 represents a code point  as 2 or 4 16-bit code units

1 or 2.


Yes, 1 or 2 16-bit code units (that's 2 or 4 bytes, of course).

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.





___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Norbert Lindenberg
I'll reply to Brendan's proposal in two parts: first about the goals for 
supplementary character support, second about the BRS.

 Full 21-bit Unicode support means all of:
 
 * indexing by characters, not uint16 storage units;
 * counting length as one greater than the last index; and
 * supporting escapes with (up to) six hexadecimal digits.

For me, full 21-bit Unicode support has a different priority list.

First come the essentials: Regular expressions; functions that interpret 
strings; the overall sense that all Unicode characters are supported.

1) Regular expressions must recognize supplementary characters as atomic 
entities, and interpret them according to Unicode semantics.

Look at the contortions one has to go through currently to describe a simple 
character class that includes supplementary characters:
https://github.com/roozbehp/yui3-gallery/blob/master/src/gallery-intl-bidi/js/intl-bidi.js

Read up on why it has to be done this way, and see to what extremes some people 
are going to make supplementary characters work despite ECMAScript:
http://inimino.org/~inimino/blog/javascript_cset

Now, try to figure out how you'd convert a user-entered string to a regular 
expression such that you can search for the string without case distinction, 
where the string may contain supplementary characters such as жвь (Deseret 
for one).

Regular expressions matter a lot here because, if done properly, they eliminate 
much of the need for iterating over strings manually.

2) Built-in functions that interpret strings have to recognize supplementary 
characters as atomic entities and interpret them according to their Unicode 
semantics. The list of functions in ES5 that violate this principle is actually 
rather short: Besides the String functions relying on regular expressions 
(match, replace, search, split), they're the String case conversion functions 
(toLowerCase, toLocaleLowerCase, toUpperCase, toLocaleUpperCase) and the 
relational comparison for strings (11.8.5). But the principle is also important 
for new functionality being considered for ES6 and above.

3) It must be clear that the full Unicode character set is allowed and 
supported. This means at least getting rid of the reference to UCS-2 (clause 2) 
and the bizarre equivalence between characters and UTF-16 code units (clause 
6). ECMAScript has already defined several ways to create UTF-16 strings 
containing supplementary characters (parsing UTF-8 source; using Unicode 
escapes for surrogate pairs), and lets applications freely pass around such 
strings. Browsers have surrounded ECMAScript implementations with text input, 
text rendering, DOM APIs, and XMLHTTPRequest with full Unicode support, and 
generally use full UTF-16 to exchange text with their ECMAScript subsystem. 
Developers have used this to build applications that support supplementary 
characters, hacking around the remaining gaps in ECMAScript as seen above. But, 
as in the bug report that Brendan pointed to this morning 
(http://code.google.com/p/v8/issues/detail?id=761), the mention of UCS-2 is 
still used by some to excuse bugs.


Only after these essentials come the niceties of String representation and 
Unicode escapes:

4) 1 String element to 1 Unicode code point is indeed a very nice and desirable 
relationship. Unlike Java, where binary compatibility between virtual machines 
made a change from UTF-16 to UTF-32 impossible, JavaScript needs to be 
compatible only at the source code level - or maybe, with a BRS, not even that.

5) If we don't go for UTF-32, then there should be a few functions to simplify 
access to strings in terms of code points, such as String.fromCodePoint, 
String.prototype.codePointAt.

6) I strongly prefer the use of plain characters over Unicode escapes in source 
code, because plain text is much easier to read than sequences of hex values. 
However, the need for Unicode escapes is greater in the space of supplementary 
characters because here we often have to reference characters for which our 
operating systems don't have glyphs yet. And \u{1D11E} certainly makes it 
easier to cross-reference a character than \uD834\uDD1E. The new escape syntax 
therefore should be on the list, at low priority.


I think it would help if other people involved in this discussion also 
clarified what exactly their requirements are for full Unicode support.

Norbert



On Feb 19, 2012, at 0:33 , Brendan Eich wrote:

 Once more unto the breach, dear friends!
 
 ES1 dates from when Unicode fit in 16 bits, and in those days, nickels had 
 pictures of bumblebees on 'em (Gimme five bees for a quarter, you'd say ;-).
 
 Clearly that was a while ago. These days, we would like full 21-bit Unicode 
 character support in JS. Some (mranney at Voxer) contend that it is a 
 requirement.
 
 Full 21-bit Unicode support means all of:
 
 * indexing by characters, not uint16 storage units;
 * counting length as one greater than the last index; and
 * supporting escapes with (up 

Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich
On Feb 21, 2012, at 6:05 PM, Norbert Lindenberg 
ecmascr...@norbertlindenberg.com wrote:

 I'll reply to Brendan's proposal in two parts: first about the goals for 
 supplementary character support, second about the BRS.
 
 Full 21-bit Unicode support means all of:
 
 * indexing by characters, not uint16 storage units;
 * counting length as one greater than the last index; and
 * supporting escapes with (up to) six hexadecimal digits.
 
 For me, full 21-bit Unicode support has a different priority list.
 
 First come the essentials: Regular expressions; functions that interpret 
 strings; the overall sense that all Unicode characters are supported.
 
 1) Regular expressions must recognize supplementary characters as atomic 
 entities, and interpret them according to Unicode semantics.

Sorry to have been unclear. In my proposal this follows from the first two 
bullets.


 2) Built-in functions that interpret strings have to recognize supplementary 
 characters as atomic entities and interpret them according to their Unicode 
 semantics.

Ditto.


 3) It must be clear that the full Unicode character set is allowed and 
 supported. 

Absolutely.


 Only after these essentials come the niceties of String representation and 
 Unicode escapes:
 
 4) 1 String element to 1 Unicode code point is indeed a very nice and 
 desirable relationship. Unlike Java, where binary compatibility between 
 virtual machines made a change from UTF-16 to UTF-32 impossible, JavaScript 
 needs to be compatible only at the source code level - or maybe, with a BRS, 
 not even that.

Right!


 5) If we don't go for UTF-32, then there should be a few functions to 
 simplify access to strings in terms of code points, such as 
 String.fromCodePoint, String.prototype.codePointAt.

Those would help smooth out different BRS settings, indeed.


 6) I strongly prefer the use of plain characters over Unicode escapes in 
 source code, because plain text is much easier to read than sequences of hex 
 values. However, the need for Unicode escapes is greater in the space of 
 supplementary characters because here we often have to reference characters 
 for which our operating systems don't have glyphs yet. And \u{1D11E} 
 certainly makes it easier to cross-reference a character than \uD834\uDD1E. 
 The new escape syntax therefore should be on the list, at low priority.

Allen and I were just discussing this as a desirable mini- strawman of its own, 
which Allen will write up for consideration at the next meeting.

We will also discuss the BRS . Did you have some thoughts on it?


 I think it would help if other people involved in this discussion also 
 clarified what exactly their requirements are for full Unicode support.

Again, apologies for not being explicit. I model the string methods as 
self-hosted using indexing and .length in straightforward ways. HTH,

/be

 
 Norbert
 
 
 
 On Feb 19, 2012, at 0:33 , Brendan Eich wrote:
 
 Once more unto the breach, dear friends!
 
 ES1 dates from when Unicode fit in 16 bits, and in those days, nickels had 
 pictures of bumblebees on 'em (Gimme five bees for a quarter, you'd say 
 ;-).
 
 Clearly that was a while ago. These days, we would like full 21-bit Unicode 
 character support in JS. Some (mranney at Voxer) contend that it is a 
 requirement.
 
 Full 21-bit Unicode support means all of:
 
 * indexing by characters, not uint16 storage units;
 * counting length as one greater than the last index; and
 * supporting escapes with (up to) six hexadecimal digits.
 
 ES4 saw bold proposals including Lars Hansen's, to allow implementations to 
 change string indexing and length incompatibly, and let Darwin sort it out. 
 I recall that was when we agreed to support \u{XX} as an extension for 
 spelling non-BMP characters.
 
 Allen's strawman from last year, 
 http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings,
  proposed a brute-force change to support full Unicode (albeit with too many 
 hex digits allowed in \u{...}), observing that There are very few places 
 where the ECMAScript specification has actual dependencies upon the size of 
 individual characters so the compatibility impact of supporting full Unicode 
 is quite small. But two problems remained:
 
 P1. As Allen wrote, There is a larger impact on actual implementations, 
 and no implementors that I can recall were satisfied that the cost was 
 acceptable. It might be, we just didn't know, and there are enough signs of 
 high cost to create this concern.
 
 P2. The change is not backward compatible. In JS today, one read a string s 
 from somewhere and hard-code, e.g., s.indexOf(0xd800 to find part of a 
 surrogate pair, then advance to the next-indexed uint16 unit and read the 
 other half, then combine to compute some result. Such usage would break.
 
 Example from Allen:
 
 var c =  // where the single character between the quotes is the Unicode 
 character U+1f638
 
 c.length == 2;
 c === \ud83d\ude38; //the 

Re: New full Unicode for ES6 idea

2012-02-21 Thread Norbert Lindenberg
Second part: the BRS.

I'm wondering how development and deployment of existing full-Unicode software 
will play out in the presence of a Big Red Switch. Maybe I'm blind and there 
are ways to simplify the process, but this is how I imagine it.

Let's start with a bit of code that currently supports full Unicode by hacking 
around ECMAScript's limitations:
https://github.com/roozbehp/yui3-gallery/blob/master/src/gallery-intl-bidi/js/intl-bidi.js

To support applications running in a BRS-on environment, Roozbeh would have to 
create a parallel version of the module that (a) takes advantage of regular 
expressions that finally support supplementary characters and (b) uses the new 
Unicode escape syntax instead of the old one. The parallel version has to be 
completely separate because a BRS-on environment would reject the old Unicode 
escapes and an ES5/BRS-off environment would reject the new Unicode escapes.

To get the code tested, he also has to create a parallel version of the test 
cases. The parallel version would be functionally identical but set up a BRS-on 
environment and use the new Unicode escape syntax instead of the old one. The 
parallel version has to be completely separate because a BRS-on environment 
would reject the old Unicode escapes and an ES5/BRS-off environment would 
reject the new Unicode escapes. Fortunately the test cases are simple.

Then he has to figure out how the two separate versions of the module will get 
loaded by clients. It's a YUI module, and the YUI loader already has the 
ability to look at several parameters to figure out what to load (minimized vs. 
debug version, localized resource bundles, etc.), so maybe the BRS should be 
another parameter? But the YUI team has a long to-do list, so in the meantime 
the module gets two separate names, and the client has to figure out which one 
to request.

The first client picking up the new version is another, bigger library. As a 
library it doesn't control the BRS, so it has to be able to run with both 
BRS-on and BRS-off. So it has to check the BRS and load the appropriate version 
of the intl-bidi module at runtime. This means, it also has to be tested in 
both environments. Its test cases are not simple. So now it needs modifications 
to the test framework to run the test suite twice, once with BRS-on and once 
with BRS-off.

An application using the library and thus the intl-bidi module decides to take 
the plunge and switch to BRS-on. It doesn't do text processing itself (that's 
what libraries are for), and it doesn't use Unicode escapes, so no code 
changes. But when it throws the switch, exceptions get thrown. It turns out 
that 3 of the 50 JavaScript files loaded during startup use old Unicode 
escapes. One of them seems to do something that might affect supplementary 
characters; for the other two apparently the developers just felt safer 
escaping all non-ASCII characters. The developers of the application don't 
actually know anything about the scripts - they got loaded indirectly by apps, 
ads, and analytics software used by the application. The developers try to find 
out whom they'll have to educate about the BRS to get this resolved.

OK - migrations are hard. But so far most participants have only seen 
additional work, no benefits. How long will this take? When will it end? When 
will browsers make BRS-on the default, let alone eliminate the switch? When can 
Roozbeh abandon his original version? Where's the blue button?

The thing to keep in mind is that most code doesn't need to know anything about 
supplementary characters. The beneficiaries of the switch are only the 
implementors of functions that do need to know, and even they won't really 
benefit until the switch is permanently on (at least for all their clients). It 
seems the switch puts a new burden on many that so far have been rightfully 
oblivious to supplementary characters.

Norbert



On Feb 19, 2012, at 0:33 , Brendan Eich wrote:

[snip]

 Allen's strawman from last year, 
 http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings,
  proposed a brute-force change to support full Unicode (albeit with too many 
 hex digits allowed in \u{...}), observing that There are very few places 
 where the ECMAScript specification has actual dependencies upon the size of 
 individual characters so the compatibility impact of supporting full Unicode 
 is quite small. But two problems remained:
 
 P1. As Allen wrote, There is a larger impact on actual implementations, and 
 no implementors that I can recall were satisfied that the cost was 
 acceptable. It might be, we just didn't know, and there are enough signs of 
 high cost to create this concern.
 
 P2. The change is not backward compatible. In JS today, one read a string s 
 from somewhere and hard-code, e.g., s.indexOf(0xd800 to find part of a 
 surrogate pair, then advance to the next-indexed uint16 unit and read the 
 other half, then combine to compute some result. 

Re: New full Unicode for ES6 idea

2012-02-20 Thread Wes Garland
On 20 February 2012 00:45, Allen Wirfs-Brock al...@wirfs-brock.com wrote:


 2) Allow invalid unicode characters in strings, and preserve them over
 concatenation – (\uD800 + \uDC00).length == 2.



 I think 2) is the only reasonable alternative.


I think so, too -- especially as any sequence of Unicode code points --
including invalid and reserved code points -- constitutes a valid Unicode
string, according to my recollection of the Unicode specification.

In addition to the reasons you listed, it should also be noted that
- 2) is cheaper to implement
- 2) keeps more old code working; ignoring the examples where developers
use String as uint16[], there are also the cases where developers scan
strings for 0xD800. 0xD800 is a reserved code point.

I don't think 1) would be a very good choice, if for no other reason the
 set of valid unicode characters is a moving target that you wouldn't want
 to hardwire into either the ES specification or implementations.


To play the devil's advocate, I could point out that the spec language
could say something about reserved code points.  Those code points are
reserved because, IIRC, they are not representable in UTF-16; they include
the ranges for the surrogate pairs.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Wes Garland
On 19 February 2012 16:34, Brendan Eich bren...@mozilla.com wrote:

 Wes Garland wrote:

 Is there a proposal for interaction with JSON?


 From http://www.ietf.org/rfc/rfc4627, 2.5


*snip* - so the proposal is to keep encoding JSON in UTF-16.  What happens
if the BRS is set to Unicode and we want to encode the string
\uD834\uDD1E -- the Unicode string which contains two reserved code
points? We do not want to deserialize this as U+1D11E.

I think we should consider that BRS-on should mean six-character escapes in
JSON for non-BMP characters.  It might even be possible to add matching
support for JSON.parse() when BRS-off.  The one caveat is that might make
JSON interchange fragile between BRS-on systems and ES5 engines.

Yes, sharing the uint16 vector is good. But string methods would have to
 index and .length differently (if I can verb .length ;-).


.lengthing is easy; cost is about the same as strlen() and can be cached.
Indexed access is something I have thought about from the implementor's POV
for a while [but not heavily].  I haven't come up with a ground-breaking
technique, I keep coming up with something that looks like a lookup table
for surrogate pairs, degrading to an extra uint32[] when there are many of
them. Anyhow, implementation detail.


 Of course, strings with the same characters are == and ===. Strings appear
 to be values. If you think of them as immutable reference types there's
 still an obligation to compare characters for strings because computed
 strings are not intern'ed.


What about strings with the same sequence of code units but different code
points? They would have identical backing stores if the backing store were
either UTF-8 or uint32. This can happen if we have BRS-on Strings which
contain non-BMP code points.(Actually, does BRS-on mean that we have to
abandon UTF-16 to store Unicode strings containing invalid code points?
Mark Davis, are you reading?)

How about strings which are considered equal by Unicode but which do not
share the same representation? Will Unicode normalization be performed when
Strings are created/parsed? On comparison? If on compare, would we skip
normalization for ===?

I assume normalizing to NFC form, similar to what W3C does, is the target?

http://www.macchiato.com/unicode/nfc-faq  (Mark Davis)
http://unicode.org/faq/normalization.html

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Wes Garland
On 20 February 2012 09:56, Andrew Oakley and...@ado.is-a-geek.net wrote:


 While this is being discussed, for any new string handling I think we
 should make any invalid strings (according to the rules in Unicode)
 cause some kind of exception on creation.


Can you clarify which definition in the Unicode standard you are proposing
for invalid string?

 Most content actually only tries to access characters of a string like
this:

 for (var i = 0; i  str.length(); i++) {
   str[i];
 }

Does anybody have any data on this?  I'm genuinely curious about how much
code on the web does any kind of character access on strings; the only
common use-case that comes to mind (other than wanting uint16[]) is users
who are doing UTF-16 on top of UCS-2.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Brendan Eich

Allen Wirfs-Brock wrote:

Last year we dispensed with the binary data hacking in strings use-case. I don't see the 
hardship. But rather than throw exceptions on concatenation I would simply eliminate the 
ability to spell code units with \u escapes. Who's with me?


I think we need to be careful not to equate the syntax of ES string literals 
with the actual encoding space of string elements.


I agree, which is why I'm saying with the BRS set, we should forbid 
\u since that is not a code point rather a code unit.



   Whether you say \ud800 or \u{00d800}, or call a function that does 
full-unicode to UTF-16 encoding, or simply create a string from file contents you may end up with 
string elements containing upper or lower half surrogates.


I don't agree in the case of \u{00d800}. That's simply an illegal code 
point, not a code unit (upper or lower half). We can reject it statically.



 Eliminating the \u syntax really doesn't change anything regarding 
actual string processing.


True, but not my point!


What it might do, however, is eliminate the ambiguity about the intended meaning of  
\uD800\uDc00 in legacy code.


And arising from concatenations, avoiding the loss of Gavin's 
distributive .length property.



If full unicode string mode only supported \u{} escapes then existing code 
that uses \u would have to be updated before it could be used in that mode.  That 
might be a good thing.


My point! ;-)

/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Brendan Eich

Gavin Barraclough wrote:


What it might do, however, is eliminate the ambiguity about the 
intended meaning of  \uD800\uDc00 in legacy code.  If full unicode 
string mode only supported \u{} escapes then existing code that uses 
\u would have to be updated before it could be used in that mode. 
 That might be a good thing.


Ah, this is a good point.  I was going to ask whether it would be 
inconsistent to deprecate \u but not \xXX, since both could just 
be considered shorthand for \u{...}, but this is a good practical 
reason why it matters more for \u (and I can imagine there may be 
complaints if we take \xXX away!).


Yes. \xXX is innocuous, since ISO 8859-1 is a proper subset of Unicode 
and can't be used to forge surrogate pair halves.



So, just to clarify,
var s1 = \u{0d800}\u{0dc00};
var s2 = String.fromCharCode(0xd800) + String.fromCharCode(0xdc00);
s1.length === 2; // true
s2.length === 2; // true
s1 === s2; // true
Does this sound like the expected behavior?


Rather, I'd statically reject the invalid code points.


Also, what would happen to String.fromCharCode?


BRS makes 21-bit chars, so just as String.prototype.charCodeAt returns a 
code point, String.fromCharCode takes actual code point arguments.


Again I'd reject (dynamically in the case of String.fromCharCode) any in 
[0xd800, 0xdfff]. Other code points that are not characters I'd let 
through to future-proof, but not these reserved ones. Also any  0x10.


1) Leave this unchanged, it would continue to truncate the input with 
ToUint16?


No, that violates the BRS intent.

2) Change its behavior to allow any code point (maybe switch to 
ToUint32, or ToInteger, and throw a RangeError for input  0x10?).


The last.

3) Make it sensitive to the state of the corresponding global object's 
BRS.


In any event, yes: this. The BRS is a switch, you can think of it as 
swapping in the other String implementation, or as a flag tested within 
one shared String implementation whose methods use if statements (which 
could be messy but would work).


We should specify carefully the identity or lack of identity of 
myGlobal.String and otherGlobalWithPossiblyDifferentBRSSetting.String, 
etc. Consider this one-line .html file:


iframe src=javascript:alert(parent.String === String)/

I get false from Chrome, Firefox and Safari, as expected. So the BRS 
could swap in another String, or simply mutate hidden state associated 
with the global in question (as mentioned in my previous post, globals 
keep track of the original values of their built-ins' prototypes, so 
implementations could put the BRS in String or String.prototype too, and 
use random logic instead of separate objects).


If we were to leave it unchanged, using ToUInt16, then I guess we 
would need a new String.fromCodePoint function, to be able to 
create strings for non-BMP characters?


This goes against the BRS design and falls down the Java slippery slope. 
We want one set of standard methods, extended from 16- to 21-bit chars, 
er, code points.


I guess my preference here would be to go with option 3 – tie the 
potentially breaking change to the BRS, but no need for new interface.


Definitely! That's perhaps unclear in my o.p. but I made a to-do out of 
rejecting Java and keeping the duplicate methods or hidden if statements 
under the implementation hood (bonnet for you ;-).


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Brendan Eich

Andrew Oakley wrote:

Issues only arise in code that tries to treat a string as an array of
16-bit integers, and I don't think we should be particularly bothered by
performance of code which misuses strings in this fashion (but clearly
this should still work without opt-in to new string handling).


This is all strings in JS and the DOM, today.

That is, we do not have any measure of code that treats strings as 
uint16s, forges strings using \u, etc. but the ES and DOM specs 
have allowed this for  14 years. Based on bitter experience, it's 
likely that if we change by fiat to 21-bit code points from 16-bit code 
units, some code on the Web will break.


And as noted in the o.p. and in the thread based on Allen's proposal 
last year, browser implementations definitely count on representation 
via array of 16-bit integers, with length property or method counting same.


Breaking the Web is off the table. Breaking implementations, less so. 
I'm not sure why you bring up UTF-8. It's good for encoding and decoding 
but for JS, unlike C, we want string to be a high level full Unicode 
abstraction. Not bytes with bits optionally set indicating more bytes 
follow to spell code points.



I think this is a nicer and more flexible model than string
representations being dependent on which heap they came from - all
issues related to encoding can be contained in the String object
implementation.


You're ignoring the compatibility break here. Browser vendors can't 
afford to do that.



While this is being discussed, for any new string handling I think we
should make any invalid strings (according to the rules in Unicode)
cause some kind of exception on creation.
This is future-hostile if done for all code points. If done only for the 
code points in [D800,DFFF] both for literals using \u{...} and for 
constructive methods such as String.fromCharCode, then I agree.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Brendan Eich

Allen Wirfs-Brock wrote:

For the moment, I'll simply take Wes' word for the above, as it logically makes 
sense.  For some uses, you want to process all possible code points (for 
example, when validating data from an external source).  At this lowest level 
you don't want to impose higher level Unicode semantic constraints:

if (stringFromElseWhere.indexOf(\u{d800})) 


Sorry, I disagree. We have a chance to keep Strings consistent with 
full Unicode, or broken into uint16 pieces. There is no 
self-consistent third way that has 21-bit code points but allows one to 
jam what up until now have been both code points and code units into 
code points, where they will be misinterpreted.


If someone wants to do data hacking, Binary Data (Typed Arrays) are 
there (even in IE10pp).



 Eliminating the \u syntax really doesn't change anything regarding 
actual string processing.

True, but not my point!


but else where you said you would reject String.fromCharCode(0xd800)


I'm being consistent (I hope!). I'd reject \u altogether with the 
BRS set. It's ambiguous at best, or (I argue, and you argue some of the 
time) it means code units, not code points. We're doing points now, no 
units, with the BRS set, so it has to go.


Same goes for constructive APIs taking (with the BRS set) code points. I 
see nothing but mischief arising from allowing [D800-DFFF]. Unicode 
gurus should school us if there's a use-case that can be sanely composed 
with full Unicode and code points, not units iteration.



so it sounds to me like you are trying to actually ban the occurrence of 0xd800 
as the value of a string element.


Under the BRS set to full Unicode, as a code point, yes.


What it might do, however, is eliminate the ambiguity about the intended meaning of  
\uD800\uDc00 in legacy code.

And arising from concatenations, avoiding the loss of Gavin's distributive 
.length property.


These aren't the same thing.

\ud8000\udc00 is a specific syntactic construct where there must have 
been a specific user intent in writing it.


(One too many 0s there.)

We do not want to guess. All I know is that \ud800\udc00 means what it 
means today in ECMA-262 and conforming implementations. With the BRS set 
to full Unicode, it could be taken to mean two code points, but that 
results in invalid Unicode and is not backward compatible. It could be 
read as one code point but that is what \u{...} is for and we want 
anyone migrating such hardcoded code into the BRS to check and choose.



  Our legacy problem is that the intent becomes ambiguous when that same 
sequence might be interpreted under different BRS settings.


I propose to solve that by forbiding \u when the BRS is set.


str1 + str2 is much less specific and all we know at runtime (assuming 
either str1 or str2 are strings) is that the user wants to concatenate them.   
The values might be:
str1= String.fromCharCode(0xd800);
str2=String.fromCharCode(0xddc00);

and the user might be intentionally constructing a string containing an 
explicit UTF-16 encoding that is going to be passed off to an external agent 
that specifically requires UTF-16.


Nope, cuz I'm proposing String.fromCharCode calls such as those throw.

We should not be making more type-confusion hazards just to play a 
guessing game that might (but probably won't) preserve some edge-case 
hardcoded surrogate hacking that exists in code on the Web or behind a 
firewall today. Such code can do what it has always done, unless and 
until its maintainer throws the BRS. At that point early and runtime 
errors will provoke rewrite to \u{...}, and with fromCharCode etc., 
21-bit code points that are not reserved for surrogates.



Another way to express what I see as the problem with what you are proposing 
about imposing such string semantics:

Could the revised ECMAScript be used to implement a language that had similar 
but not identical semantic rules to those you are suggested for ES strings.  My 
sense is that if we went down the path you are suggesting, such a 
implementation would have to use binary data arrays for all of its internal 
string processing and could not use ES string functions to process them.


If you mean a metacircular evaluator, I don't think so. Can you show a 
counterexample?


If you mean a UTF-transcoder, then yes: binary data / typed arrays are 
required. That's the right answer.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Brendan Eich

Allen Wirfs-Brock wrote:
I really don't think any Unicode semantics should be build into the 
basic string representation.  We need to decide on a max element size 
and Unicode motivates 21 bits, but it could be 32-bits.  Personally, 
I've lived through enough address space exhaustion episodes in my 
career be skeptical of small values like 2^21 being good enough for 
the long term.


This does not seem justified to me as a future-proofing step. Instead, 
it invites my corollary to Postel's Law:


If you are liberal in what you accept, others will utterly fail to be 
conservative in what they send.


to bite us, hard.

We do not want implementations today to accept non-Unicode code points 
under the BRS (also [D800-DFFF], IMHO). If tomorrow or on April 5, 2063 
when Vulcans arrive to make first contact, we need 32 bits, we can be 
liberal then. Old implementations will choke on Vulcan, Klingon, etc., 
but so they should! They cannot do better, and simply need to be upgraded.


OTOH if we are too liberal now, people will stuff non-Unicode code 
points into strings and it will be up to a receiving peer on the 
Internet to make it right (or wrong). Receiver-makes-it-wrong failed in 
the 80s RPC wars.


Postel's law is not about allowing unknown new bits to flow into 
containers. It is about unexpected combinations at higher message and 
header/field levels. Note that the IP protocol had to pick 4-byte 
addresses, and IPv6 could not be foreseen or usefully future-proofed by 
using wider fields without specific rules governing the use of the extra 
bytes.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Allen Wirfs-Brock

On Feb 20, 2012, at 10:52 AM, Brendan Eich wrote:

 Allen Wirfs-Brock wrote:
 ...
 Another way to express what I see as the problem with what you are proposing 
 about imposing such string semantics:
 
 Could the revised ECMAScript be used to implement a language that had 
 similar but not identical semantic rules to those you are suggested for ES 
 strings.  My sense is that if we went down the path you are suggesting, such 
 a implementation would have to use binary data arrays for all of its 
 internal string processing and could not use ES string functions to process 
 them.
 
 If you mean a metacircular evaluator, I don't think so. Can you show a 
 counterexample?
 
 If you mean a UTF-transcoder, then yes: binary data / typed arrays are 
 required. That's the right answer.

Not necessarily, metacircular...it could be support for any language that 
imposes different semantic rules on string elements.

You are essentially saying that a compiler targeting ES for a language X  that 
includes a string data type that does not confirm to your rules (for example, 
by allowing occurrences of surrogate code points within string data) could not 
use ES strings as the target representation of its string data type.  It also 
could not use the built-in ES string functions in the implementation of 
language X's built-in functions.  It could not leverage any optimizations that 
a ES engine may apply to strings and string functions.  Also, values of X's 
string type can not be directly passed in foreign calls to ES functions. Etc.

Allen___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Gavin Barraclough
On Feb 20, 2012, at 8:37 AM, Brendan Eich wrote:
 BRS makes 21-bit chars, so just as String.prototype.charCodeAt returns a code 
 point, String.fromCharCode takes actual code point arguments.
 
 Again I'd reject (dynamically in the case of String.fromCharCode) any in 
 [0xd800, 0xdfff]. Other code points that are not characters I'd let through 
 to future-proof, but not these reserved ones. Also any  0x10.

Okay, gotcha – so to clarify, once the BRS is thrown, it should be impossible 
to create a string in which any individual element is an unassigned code point 
(e.g. an unpaired UTF-16 surrogate) – all strings elements should be valid 
unicode characters, right? (or maybe a slightly weaker form of this, all string 
elements must be code points in the ranges 0...0xD7FF or 0xE000...0x10?).

 Implementations that use uint16 vectors as the character data representation 
 type for both UCS-2 and UTF-16 string variants would probably want 
 another flag bit per string header indicating whether, for the UTF-16 case, 
 the string indeed contained any non-BMP characters. If not, no proxy/copy 
 needed.

If I understand your original proposal, you propose that UCS-2 strings coming 
from other sources be proxied to be iterated by unicode characters (e.g. if the 
DOM returns a string containing the code units \uD800\uDC00 then JS code 
executing in a context with the BRS set will see this as having length 1, 
right?)  If so, do you propose any special handling for access to unassigned 
unicode code points in UCS-2 strings returned from the DOM (or accessed from 
another global object, where the BRS is not set).

e.g.
var ucs2d800 = foo(); // get a string containing \uD800 from the DOM, 
or another global object in BRS=off mode;
var ucs2dc00 = bar(); // get a string containing \uDC00 from the DOM, 
or another global object in BRS=off mode;
var a = ucs2d800[0];
var b = ucs2d800.charCodeAt(0);
var c = ucs2d800 + ucs2dc00;
var c0 = c.charCodeAt(0);
var c1 = c.charCodeAt(1);

If the proxy is to behave as is the UCS-2 sting has been converted to a valid 
unicode string, then I'm guessing that conversion should have converted the 
unmatched surrogates in the UCS-2 into unicode replacement characters? – if so, 
the length of c in the above example would be 2, and the values c0  c1 would 
be 0xFFFD?

cheers,
G.

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Brendan Eich

Allen Wirfs-Brock wrote:


On Feb 20, 2012, at 10:52 AM, Brendan Eich wrote:


Allen Wirfs-Brock wrote:
...
Another way to express what I see as the problem with what you are 
proposing about imposing such string semantics:


Could the revised ECMAScript be used to implement a language that 
had similar but not identical semantic rules to those you are 
suggested for ES strings.  My sense is that if we went down the path 
you are suggesting, such a implementation would have to use binary 
data arrays for all of its internal string processing and could not 
use ES string functions to process them.


If you mean a metacircular evaluator, I don't think so. Can you show 
a counterexample?


If you mean a UTF-transcoder, then yes: binary data / typed arrays 
are required. That's the right answer.


Not necessarily, metacircular...it could be support for any language 
that imposes different semantic rules on string elements.


In that case, binary data / typed arrays, definitely.

You are essentially saying that a compiler targeting ES for a language 
X  that includes a string data type that does not confirm to your 
rules (for example, by allowing occurrences of surrogate code points 
within string data)
First, as a point of order: yes, JS strings as full Unicode does not 
want stray surrogate pair-halves. Does anyone disagree?


Second, binary data / typed arrays stand ready for any such 
not-full-Unicode use-cases.


could not use ES strings as the target representation of its string 
data type.  It also could not use the built-in ES string functions in 
the implementation of language X's built-in functions.


Not if this hypothetical source language being compiled to JS wants 
other than full Unicode, no.


Why is this a problem, even hypothetically? Such a use-case has binary 
data and typed arrays standing ready, and if it really could use 
String.prototype.* methods I would be greatly surprised.


 It could not leverage any optimizations that a ES engine may apply to 
strings and string functions.


Emscripten already compiles LLVM source languages (C, C++, and 
Objective-C at least) to JS and does a very good job (getting better day 
by day). The utility of string function today (including uint16 indexing 
and length) is immaterial. Typed arrays are quite important, though.


Also, values of X's string type can not be directly passed in foreign 
calls to ES functions. Etc.


Emscripten does have a runtime that maps browser functionailty exposed 
to JS to the guest language. It does not AFAIK need to encode surrogate 
pairs in JS strings by hand, let alone make pair-halves.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Allen Wirfs-Brock

On Feb 20, 2012, at 12:32 PM, Brendan Eich wrote:

 Allen Wirfs-Brock wrote:
 
 ...
 
 You are essentially saying that a compiler targeting ES for a language X  
 that includes a string data type that does not confirm to your rules (for 
 example, by allowing occurrences of surrogate code points within string data)
 First, as a point of order: yes, JS strings as full Unicode does not want 
 stray surrogate pair-halves. Does anyone disagree?

Well, I'm disagreeing.  Do you know of any other language that has imposed 
these sorts of semantic restrictions on runtime string data?  

 
 Second, binary data / typed arrays stand ready for any such not-full-Unicode 
 use-cases.

But lacks the same level of utility function support, not the least of which is 
RegExp

 could not use ES strings as the target representation of its string data 
 type.  It also could not use the built-in ES string functions in the 
 implementation of language X's built-in functions.
 
 Not if this hypothetical source language being compiled to JS wants other 
 than full Unicode, no.
 
 Why is this a problem, even hypothetically? Such a use-case has binary data 
 and typed arrays standing ready, and if it really could use 
 String.prototype.* methods I would be greatly surprised.

My sense is that there are a fairly large variety of string data types could be 
use the existing ES5 string type as a target type and for which many of the 
String.prototuype.* methods would function just fine  The reason is that most 
of the ES5 methods don't impose this sort of semantic restriction of string 
elements.

 
 It could not leverage any optimizations that a ES engine may apply to 
 strings and string functions.
 
 Emscripten already compiles LLVM source languages (C, C++, and Objective-C at 
 least) to JS and does a very good job (getting better day by day). The 
 utility of string function today (including uint16 indexing and length) is 
 immaterial. Typed arrays are quite important, though.

There are a lot of reasons why ES strings are not a good backing representation 
for C/C++ strings (to the extend that there even is a C string data type).  But 
there are also lots of  high level languages that do not have those sort of 
mapping issues.

If Type arrays are going to be the new string type  (maybe better stated as 
array of chars) for people doing systems programming in JS then we should 
probably start thinking about a broader set of utility functions/methods that 
support them.

 
 Also, values of X's string type can not be directly passed in foreign calls 
 to ES functions. Etc.
 
 Emscripten does have a runtime that maps browser functionailty exposed to JS 
 to the guest language. It does not AFAIK need to encode surrogate pairs in JS 
 strings by hand, let alone make pair-halves.

But with the BRS flipped it would have to censor C strings passed to JS to 
ensure that unmatched surrogate pairs are present.

Probably not such a bit deal because it isn't using JS strings as its 
representation, but as hypothesized above that wouldn't necessarily be the case 
for other languages.

Allen
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Wes Garland
On 20 February 2012 16:00, Allen Wirfs-Brock al...@wirfs-brock.com wrote:

 My sense is that there are a fairly large variety of string data types
 could be use the existing ES5 string type as a target type and for which
 many of the String.prototuype.* methods would function just fine  The
 reason is that most of the ES5 methods don't impose this sort of semantic
 restriction of string elements.


To pick one out of a hat, it might be nice to be able to use non-Unicode
encodings, like GB 18030 or BIG5, and be able to use regexp methods on them
when the BRS is on. (I'm struggling to find a really real real-world
use-case, though)

Observation -- disallowing otherwise legal Unicode strings because they
contain code points d800-dfff has very concrete implementation benefits:
it's possible to use UTF-16 to represent the String's backing store.
Without this concession, I fear it may not be possible to implement BRS-on
without using a UTF-8 or full code point  backing store (or some
non-standard invention).

Maybe the answer is to consider (shudder) adding String-like utility
functions to the TypedArrays?  FWIW, CommonJS tried to go down this path
and it turned out to be a lot of work for very little benefit (if any).

But with the BRS flipped it would have to censor C strings passed to JS
 to ensure that unmatched surrogate pairs are present.


Only if the C strings are wide-character strings.  8-bit char strings are
fine, they map right onto Latin-1 in native Unicode as well as the UTF-16
and UCS-2 encodings.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Brendan Eich

Allen Wirfs-Brock wrote:

On Feb 20, 2012, at 12:32 PM, Brendan Eich wrote:


Allen Wirfs-Brock wrote:

...
You are essentially saying that a compiler targeting ES for a language X  that 
includes a string data type that does not confirm to your rules (for example, 
by allowing occurrences of surrogate code points within string data)

First, as a point of order: yes, JS strings as full Unicode does not want stray 
surrogate pair-halves. Does anyone disagree?


Well, I'm disagreeing.  Do you know of any other language that has imposed 
these sorts of semantic restrictions on runtime string data?

Sure, Java:


 String

public*String*(int[] codePoints,
  int offset,
  int count)

   Allocates a new|String|that contains characters from a subarray of
   the Unicode code point array argument. The|offset|argument is the
   index of the first code point of the subarray and the|count|argument
   specifies the length of the subarray. The contents of the subarray
   are converted to|char|s; subsequent modification of the|int|array
   does not affect the newly created string.

   *Parameters:*
   |codePoints|- array that is the source of Unicode code points.
   |offset|- the initial offset.
   |count|- the length.
   *Throws:*
   |IllegalArgumentException
   
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/IllegalArgumentException.html|-
   if any invalid Unicode code point is found in|codePoints|
   |IndexOutOfBoundsException
   
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/IndexOutOfBoundsException.html|-
   if the|offset|and|count|arguments index characters outside the
   bounds of the|codePoints|array.
   *Since:*
   1.5




Second, binary data / typed arrays stand ready for any such not-full-Unicode 
use-cases.


But lacks the same level of utility function support, not the least of which is 
RegExp


RegExp is miserabe for Unicode, it's true. That doesn't strike me as 
compelling for making full-Unicode string more bug-prone.


There is a strong case to be made for evolving RegExp to be usable with 
certain typed arrays (byte, uint16 at least). But that's another thread.


We should beef up RegExp Unicode escapes; another 'nother thread.


could not use ES strings as the target representation of its string data type.  
It also could not use the built-in ES string functions in the implementation of 
language X's built-in functions.

Not if this hypothetical source language being compiled to JS wants other than 
full Unicode, no.

Why is this a problem, even hypothetically? Such a use-case has binary data and 
typed arrays standing ready, and if it really could use String.prototype.* 
methods I would be greatly surprised.


My sense is that there are a fairly large variety of string data types could be 
use the existing ES5 string type as a target type and for which many of the 
String.prototuype.* methods would function just fine  The reason is that most 
of the ES5 methods don't impose this sort of semantic restriction of string 
elements.


If that's true then we should have enough evidence that I'll happily 
concede the point and the spec will allow uD800 etc. in BRS-enabled 
literals. I do not see such evidence.



It could not leverage any optimizations that a ES engine may apply to strings 
and string functions.

Emscripten already compiles LLVM source languages (C, C++, and Objective-C at 
least) to JS and does a very good job (getting better day by day). The utility 
of string function today (including uint16 indexing and length) is immaterial. 
Typed arrays are quite important, though.


There are a lot of reasons why ES strings are not a good backing representation 
for C/C++ strings (to the extend that there even is a C string data type).  But 
there are also lots of  high level languages that do not have those sort of 
mapping issues.


Let's name some:

Java: see above. There may be some legacy need to support invalid 
Unicode but I'm not seeing it right now. Anyone?


Python: 
http://docs.python.org/release/3.0.1/howto/unicode.html#the-string-type 
-- lots here, similar to Ruby 1.9 (see below) but not obviously in need 
of invalid Unicode stored in uint16 vectors handled as JS strings.


Ruby: Ruby supports strings with multiple encodings; the encoding is 
part of the string's metadata. I am not the expert here, but I found


http://www.tbray.org/ongoing/When/200x/2008/09/18/Ruby-I18n

helpful, and the more recent

http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/

too. See also these very interesting posts from Sam Ruby in 2007:

http://intertwingly.net/blog/2007/12/28/3-1-2
http://intertwingly.net/blog/2007/12/29/Ruby-1-9-Strings-Updated

Ruby raises exceptions when you mix two strings with different encodings 
incorrectly, e.g. by concatenation.


I'm not sure about surrogate validation, but from all this I gather that 
compiling Ruby to JS in full would need 

Re: New full Unicode for ES6 idea

2012-02-20 Thread Norbert Lindenberg
As Brendan's link indicates, JSON is specified by RFC 4627, not by the 
ECMAScript Language Specification. JSON is widely used for data exchange with 
and between systems that have nothing to do with ECMAScript and the proposed 
BRS - see the middle section of
http://www.json.org/

So the only thing that can (and must) be done if and when updating the 
ECMAScript Language Specification for the BRS is to update the JSON section 
(15.12 in ES5) to describe how to map from the existing JSON syntax to the new 
BRS-on String representation. Note that JSON.stringify doesn't create Unicode 
escapes for anything other than control characters (presumably those identified 
in RFC 4627).

Norbert


On Feb 20, 2012, at 4:45 , Wes Garland wrote:

 On 19 February 2012 16:34, Brendan Eich bren...@mozilla.com wrote:
 Wes Garland wrote:
 Is there a proposal for interaction with JSON?
 
 From http://www.ietf.org/rfc/rfc4627, 2.5
 
 *snip* - so the proposal is to keep encoding JSON in UTF-16.  What happens if 
 the BRS is set to Unicode and we want to encode the string \uD834\uDD1E -- 
 the Unicode string which contains two reserved code points? We do not want to 
 deserialize this as U+1D11E.
 
 I think we should consider that BRS-on should mean six-character escapes in 
 JSON for non-BMP characters.  It might even be possible to add matching 
 support for JSON.parse() when BRS-off.  The one caveat is that might make 
 JSON interchange fragile between BRS-on systems and ES5 engines.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Allen Wirfs-Brock

On Feb 20, 2012, at 3:14 PM, Brendan Eich wrote:

 Allen Wirfs-Brock wrote:
 On Feb 20, 2012, at 12:32 PM, Brendan Eich wrote:
 
 Allen Wirfs-Brock wrote:
 ...
 You are essentially saying that a compiler targeting ES for a language X  
 that includes a string data type that does not confirm to your rules (for 
 example, by allowing occurrences of surrogate code points within string 
 data)
 First, as a point of order: yes, JS strings as full Unicode does not want 
 stray surrogate pair-halves. Does anyone disagree?
 
 Well, I'm disagreeing.  Do you know of any other language that has imposed 
 these sorts of semantic restrictions on runtime string data?
 Sure, Java:
 
 
 String
 
 public*String*(int[] codePoints,
  int offset,
  int count)
 
   Allocates a new|String|that contains characters from a subarray of
   the Unicode code point array argument. The|offset|argument is the
   index of the first code point of the subarray and the|count|argument
   specifies the length of the subarray. The contents of the subarray
   are converted to|char|s; subsequent modification of the|int|array
   does not affect the newly created string.
 
   *Parameters:*
   |codePoints|- array that is the source of Unicode code points.
   |offset|- the initial offset.
   |count|- the length.
   *Throws:*
   |IllegalArgumentException
   
 http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/IllegalArgumentException.html|-
   if any invalid Unicode code point is found in|codePoints|
   |IndexOutOfBoundsException
   
 http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/IndexOutOfBoundsException.html|-
   if the|offset|and|count|arguments index characters outside the
   bounds of the|codePoints|array.
   *Since:*
   1.5
 

 Note that the above say invalid Unicode code point. 0xd800 is a valid 
Unicode code point.  It isn't a valid Unicode characters. 

See 
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isValidCodePoint(int)
 

Determines whether the specified code point is a valid Unicode code point value 
in the range of 0x to 0x10 inclusive. This method is equivalent to the 
expression:
 codePoint = 0x  codePoint = 0x10
 
 Why is this a problem, even hypothetically? Such a use-case has binary data 
 and typed arrays standing ready, and if it really could use 
 String.prototype.* methods I would be greatly surprised.
 
 My sense is that there are a fairly large variety of string data types could 
 be use the existing ES5 string type as a target type and for which many of 
 the String.prototuype.* methods would function just fine  The reason is that 
 most of the ES5 methods don't impose this sort of semantic restriction of 
 string elements.
 
 If that's true then we should have enough evidence that I'll happily concede 
 the point and the spec will allow uD800 etc. in BRS-enabled literals. I do 
 not see such evidence.

Note my concern isn't so much about literals as it is about string elements 
created via String.fromCharCode

The only String.prototype method algorithms seem to have any Unicode 
dependencies are toLowerCase/toUpperCase and the locale variants of those 
methods and perhaps  localeCompare, trim (knowns Unicode white space character 
classification, and the regular expression based methods if the regexp is 
constructed with literal chars or uses character classes. 

All concat, splice, splice, substring, indexOf/lastIndexOf, non-regexp based 
replace and split calls all are defined in terms of string element value 
comparisons and don't really care about what characters set is used.

Wes Garland mentioned the possibility of using non-Unicode character sets such 
as Big5

 
 It could not leverage any optimizations that a ES engine may apply to 
 strings and string functions.
 Emscripten already compiles LLVM source languages (C, C++, and Objective-C 
 at least) to JS and does a very good job (getting better day by day). The 
 utility of string function today (including uint16 indexing and length) is 
 immaterial. Typed arrays are quite important, though.
 
 There are a lot of reasons why ES strings are not a good backing 
 representation for C/C++ strings (to the extend that there even is a C 
 string data type).  But there are also lots of  high level languages that do 
 not have those sort of mapping issues.
 
 Let's name some:
 
 Java: see above. There may be some legacy need to support invalid Unicode but 
 I'm not seeing it right now. Anyone?

see above, it allows all Unicode points.  does not restrict strings to well 
formed UTF-16 encodings.


 
 Python: 
 http://docs.python.org/release/3.0.1/howto/unicode.html#the-string-type -- 
 lots here, similar to Ruby 1.9 (see below) but not obviously in need of 
 invalid Unicode stored in uint16 vectors handled as JS strings.
 

I don't see any restrictions on inserting in that doc about strings containing 
\ud800 and friends.  Unless there are, BRS 

Re: New full Unicode for ES6 idea

2012-02-20 Thread Allen Wirfs-Brock

On Feb 20, 2012, at 1:42 PM, Wes Garland wrote:

 On 20 February 2012 16:00, Allen Wirfs-Brock al...@wirfs-brock.com wrote:
 
 ...
 Observation -- disallowing otherwise legal Unicode strings because they 
 contain code points d800-dfff has very concrete implementation benefits: it's 
 possible to use UTF-16 to represent the String's backing store.  Without this 
 concession, I fear it may not be possible to implement BRS-on without using a 
 UTF-8 or full code point  backing store (or some non-standard invention).

(or using multiple representations)
 

Yes, I understand.  If it is a requirement (or even a goal) to enable 
implementation to use UTF-16 as the backing store, we should be clearer about 
it being so.  


 Maybe the answer is to consider (shudder) adding String-like utility 
 functions to the TypedArrays?  FWIW, CommonJS tried to go down this path and 
 it turned out to be a lot of work for very little benefit (if any). 
 
 But with the BRS flipped it would have to censor C strings passed to JS to 
 ensure that unmatched surrogate pairs are present.
 
 Only if the C strings are wide-character strings.  8-bit char strings are 
 fine, they map right onto Latin-1 in native Unicode as well as the UTF-16 and 
 UCS-2 encodings.

Yes, I was assuming WCHAR strings

Allen___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-20 Thread Brendan Eich

Allen Wirfs-Brock wrote:

On Feb 20, 2012, at 3:14 PM, Brendan Eich wrote:
Note that the above say invalid Unicode code point. 0xd800 is a 
valid Unicode code point. It isn't a valid Unicode characters.


See 
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isValidCodePoint(int) 
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isValidCodePoint%28int%29 



Determines whether the specified code point is a valid Unicode
code point value in the range of |0x| to |0x10| inclusive.
This method is equivalent to the expression:

  codePoint= 0x  codePoint= 0x10



I should have remembered this, from the old days of Java and JS talking 
(LiveConnect). Strike one for me.


If that's true then we should have enough evidence that I'll happily 
concede the point and the spec will allow uD800 etc. in BRS-enabled 
literals. I do not see such evidence.


Note my concern isn't so much about literals as it is about string 
elements created via String.fromCharCode


The only String.prototype method algorithms seem to have any Unicode 
dependencies are toLowerCase/toUpperCase and the locale variants of 
those methods and perhaps localeCompare, trim (knowns Unicode white 
space character classification, and the regular expression based 
methods if the regexp is constructed with literal chars or uses 
character classes.


All concat, splice, splice, substring, indexOf/lastIndexOf, non-regexp 
based replace and split calls all are defined in terms of string 
element value comparisons and don't really care about what characters 
set is used.


Wes Garland mentioned the possibility of using non-Unicode character 
sets such as Big5


These are byte-based enodings, no? What is the problem inflating them by 
zero extension to 16 bits now (or 21 bits in the future)? You can't make 
an invalid Unicode character from a byte value.


Anyway, Big5 punned into JS strings (via a C or C++ API?) is *not* a 
strong use-case for ignoring invalid characters.


Ball one. :-P

Python: 
http://docs.python.org/release/3.0.1/howto/unicode.html#the-string-type 
-- lots here, similar to Ruby 1.9 (see below) but not obviously in 
need of invalid Unicode stored in uint16 vectors handled as JS strings.




I don't see any restrictions on inserting in that doc about strings 
containing \ud800 and friends. Unless there are, BRS enabled ES 
strings couldn't be used as the representation type for python strings.


You're right, you can make a literal in Python 3 such as '\ud800' 
without error. Strike two.


Ruby: Ruby supports strings with multiple encodings; the encoding is 
part of the string's metadata. I am not the expert here, but I found


http://www.tbray.org/ongoing/When/200x/2008/09/18/Ruby-I18n

helpful, and the more recent

http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/

too. See also these very interesting posts from Sam Ruby in 2007:

http://intertwingly.net/blog/2007/12/28/3-1-2
http://intertwingly.net/blog/2007/12/29/Ruby-1-9-Strings-Updated

Ruby raises exceptions when you mix two strings with different 
encodings incorrectly, e.g. by concatenation.


I'm not sure about surrogate validation, but from all this I gather 
that compiling Ruby to JS in full would need Uint8Array, along with 
lots of runtime helpers that do not come for free from JS's 
String.prototype methods, in order to handle all the encodings.

\\
If Type arrays are going to be the new string type (maybe better 
stated as array of chars) for people doing systems programming in JS 
then we should probably start thinking about a broader set of 
utility functions/methods that support them.


But, using current 16-bit JS string semantics a JS string could still 
be used as the character store for many of these encodings with the 
meta data stored separately (probably a RubyString wrapper object) and 
the char set insensive JS string methods could be used to implement 
the Ruby semantics.


Did I get a hit off your pitch, then? Because Ruby does at least raise 
exceptions on mixed encoding concatenations.


But I'm about to strike out on the next pitch (language). You're almost 
certainly right that most languages with full Unicode support allow 
the programmer to create invalid strings via literals and constructors. 
It also seems common for charset-encoding APIs to validate and throw on 
invalid character, which makes sense.


I could live with this, especially for String.fromCharCode.

For \uD800... in a BRS-enabled string literal, it still seems to me 
something is going to go wrong right away. Or rather, something *should* 
(like, early error). But based on precedent, and for the odd usage that 
doesn't go wrong ever (reads back code units, or has native code reading 
them and reassembling uint16 elements), I'll go along here too.


This means Gavin's option

2) Allow invalid unicode characters in strings, and preserve them over 
concatenation – (\uD800 + 

Re: New full Unicode for ES6 idea

2012-02-19 Thread Jussi Kalliokoski
I'm not sure what to think about this, being a big fan of the UTF-8
simplicity. :) But anyhow, I like the idea of opt-in, actually so much that
I started thinking, why not make JS be encoding-agnostic?

What I mean here is that maybe we could have multi-charset Strings in JS?
This would be useful especially on the server-side JS. So, what I'm
suggesting is an extension to the String class, maybe defined as follows
(while being just the first thing from the top of my head).

Let's say we have a loadFile function that takes a filename and reads its
contents to a string it returns.

loadFile('my-utf8-file').charset === 'UTF-8'
String(loadFile('my-file'), 'UTF-16').charset === 'UTF-16'
loadFile('my-file').toString('UTF-9').charset === 'UTF-9'
32..toString(10, 'UTF-8').charset === 'UTF-8'

// And hence, we could add easy sugar to the function as well,
loadFile('my-utf8-file', 'UTF-16').charset === 'UTF-16';

What do you think?

Obviously this creates a lot of problems, but backwards compatibility could
(maybe) be preserved without an opt-in, if the default charset would stay
the same.

Cheers,
Jussi

On Sun, Feb 19, 2012 at 10:33 AM, Brendan Eich bren...@mozilla.com wrote:

 Once more unto the breach, dear friends!

 ES1 dates from when Unicode fit in 16 bits, and in those days, nickels had
 pictures of bumblebees on 'em (Gimme five bees for a quarter, you'd say
 ;-).

 Clearly that was a while ago. These days, we would like full 21-bit
 Unicode character support in JS. Some (mranney at Voxer) contend that it is
 a requirement.

 Full 21-bit Unicode support means all of:

 * indexing by characters, not uint16 storage units;
 * counting length as one greater than the last index; and
 * supporting escapes with (up to) six hexadecimal digits.

 ES4 saw bold proposals including Lars Hansen's, to allow implementations
 to change string indexing and length incompatibly, and let Darwin sort it
 out. I recall that was when we agreed to support \u{XX} as an
 extension for spelling non-BMP characters.

 Allen's strawman from last year, http://wiki.ecmascript.org/**
 doku.php?id=strawman:support_**full_unicode_in_stringshttp://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings,
 proposed a brute-force change to support full Unicode (albeit with too many
 hex digits allowed in \u{...}), observing that There are very few places
 where the ECMAScript specification has actual dependencies upon the size of
 individual characters so the compatibility impact of supporting full
 Unicode is quite small. But two problems remained:

 P1. As Allen wrote, There is a larger impact on actual implementations,
 and no implementors that I can recall were satisfied that the cost was
 acceptable. It might be, we just didn't know, and there are enough signs of
 high cost to create this concern.

 P2. The change is not backward compatible. In JS today, one read a string
 s from somewhere and hard-code, e.g., s.indexOf(0xd800 to find part of a
 surrogate pair, then advance to the next-indexed uint16 unit and read the
 other half, then combine to compute some result. Such usage would break.

 Example from Allen:

 var c =  // where the single character between the quotes is the
 Unicode character U+1f638

 c.length == 2;
 c === \ud83d\ude38; //the two character UTF-16 encoding of 0x1f683
 c.charCodeAt(0) == 0xd83d;
 c.charCodeAt(1) == 0xd338;

 (Allen points out how browsers, node.js, and other environments blindly
 handle UTF-8 or whatever incoming format recoding to UTF-16 upstream of the
 JS engine, so the above actually works without any spec-language in
 ECMA-262 saying it should.)

 So based on a recent twitter/github exchange, gist recorded at
 https://gist.github.com/**1850768 https://gist.github.com/1850768, I
 would like to propose a variation on Allen's proposal that resolves both of
 these problems. Here are resolutions in reverse order:

 R2. No incompatible change without opt-in. If you hardcode as in Allen's
 example, don't opt in without changing your index, length, and char/code-at
 assumptions.

 Such opt-in cannot be a pragma since those have lexical scope and affect
 code, not the heap where strings and String.prototype methods live.

 We also wish to avoid exposing a full Unicode representation type and
 duplicated suite of the String static and prototype methods, as Java did.
 (We may well want UTF-N transcoding helpers; we certainly want ByteArray
 - UTF-8 transcoding APIs.)

 True, R2 implies there are two string primitive representations at most,
 or more likely 1.x for some fraction .x. Say, a flag bit in the string
 header to distinguish JS's uint16-based indexing (UCS-2) from
 non-O(1)-indexing UTF-16. Lots of non-observable implementation options
 here.

 Instead of any such *big* new observables, I propose a so-called Big Red
 [opt-in] Switch (BRS) on the side of a unit of VM isolation: specifically
 the global object.

 Why the global object? Because for many VMs, each global has its own 

Re: New full Unicode for ES6 idea

2012-02-19 Thread Axel Rauschmayer
On Feb 19, 2012, at 9:33 , Brendan Eich wrote:

 Instead of any such *big* new observables, I propose a so-called Big Red 
 [opt-in] Switch (BRS) on the side of a unit of VM isolation: specifically 
 the global object.


es-discuss-only idea: could that BRS be made to carry more weight? Could it be 
a switch for all breaking ES.next changes?

-- 
Dr. Axel Rauschmayer
a...@rauschma.de

home: rauschma.de
twitter: twitter.com/rauschma
blog: 2ality.com

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Peter van der Zee
Do we know how many scripts actually rely on \u15 to produce a
stringth length of 3? Might it make more sense to put the new unicode
escape under a different escape? Something like \e for extended
unicode for example. Or is this acceptable migration tax...

On a side note, if we're going to do this, can we also have aliasses
in regex to parse certain unicode categories? For instance, the es
spec defines the Uppercase Letter (Lu), Lowercase Letter (Ll),
Titlecase letter (Lt), Modifier letter (Lm), Other letter (Lo),
Letter number (Nl), Non-spacing mark (Mn), Combining spacing mark
(Mc), Decimal number (Nd) and Connector punctuation (Pc) as
possible identifier parts. But right now I have to go very out of my
way (http://qfox.nl/notes/90) to generate and end up with a 56k script
that's almost pure regex.

This works and performance is amazingly fair, but it'd make more sense
to be able to do \pLt or something, to parse any character in the
Titlecase letter category. As far as I understand, these categories
have to be known and supported anyways so these switches shouldn't
cause too much trouble in that regard, at least.

- peter

On Sun, Feb 19, 2012 at 10:17 AM, Axel Rauschmayer a...@rauschma.de wrote:
 On Feb 19, 2012, at 9:33 , Brendan Eich wrote:

 Instead of any such *big* new observables, I propose a so-called Big Red
 [opt-in] Switch (BRS) on the side of a unit of VM isolation: specifically
 the global object.


 es-discuss-only idea: could that BRS be made to carry more weight? Could it
 be a switch for all breaking ES.next changes?

 --
 Dr. Axel Rauschmayer
 a...@rauschma.de

 home: rauschma.de
 twitter: twitter.com/rauschma
 blog: 2ality.com


 ___
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Mathias Bynens

 On a side note, if we're going to do this, can we also have aliasses
 in regex to parse certain unicode categories? For instance, the es
 spec defines the Uppercase Letter (Lu), Lowercase Letter (Ll),
 Titlecase letter (Lt), Modifier letter (Lm), Other letter (Lo),
 Letter number (Nl), Non-spacing mark (Mn), Combining spacing mark
 (Mc), Decimal number (Nd) and Connector punctuation (Pc) as
 possible identifier parts. But right now I have to go very out of my
 way (http://qfox.nl/notes/90) to generate and end up with a 56k script
 that's almost pure regex.


FWIW, it can be done in “just” 11,335 characters:



Re: New full Unicode for ES6 idea

2012-02-19 Thread Mark S. Miller
On Sun, Feb 19, 2012 at 12:33 AM, Brendan Eich bren...@mozilla.com wrote:
[...]

 Why the global object? Because for many VMs, each global has its own heap
 or sub-heap (compartment), and all references outside that heap are to
 local proxies that copy from, or in the case of immutable data, reference
 the remote heap.

[...]

Is this true for same origin iframes? I have always assumed that mixing
heaps between same origin iframes results in unmediated direct
object-to-object access. If these are already mediated, what was the issue
that drove us to that?

-- 
Cheers,
--MarkM
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Lasse Reichstein
On Sun, Feb 19, 2012 at 12:12 PM, Mark S. Miller erig...@google.com wrote:
 On Sun, Feb 19, 2012 at 12:33 AM, Brendan Eich bren...@mozilla.com wrote:
 [...]

 Why the global object? Because for many VMs, each global has its own heap
 or sub-heap (compartment), and all references outside that heap are to
 local proxies that copy from, or in the case of immutable data, reference
 the remote heap.

 [...]

 Is this true for same origin iframes? I have always assumed that mixing
 heaps between same origin iframes results in unmediated direct
 object-to-object access. If these are already mediated, what was the issue
 that drove us to that?

In V8, same origin contexts (or really, any contexts that might
communicate in any way) live in the same heap. Originally, that meant
anything running in the same process was in the same heap, but with
isolates, there can be more heaps in the same process.
You can still determine the origin of an object, to do any necessary
security checks, but references to foreign objects are always plain
pointers into the same heap.

If I have understood the description correctly, I believe Opera merge
heaps from different frames if they start
communicating, effectively putting them in the same heap.
http://my.opera.com/core/blog/2009/12/22/carakan-revisited

/L
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Wes Garland
On 19 February 2012 03:33, Brendan Eich bren...@mozilla.com wrote:

 S1 dates from when Unicode fit in 16 bits, and in those days, nickels had
 pictures of bumblebees on 'em (Gimme five bees for a quarter, you'd say
 ;-).


Say, is that an onion on your belt?


 * indexing by characters, not uint16 storage units;
 * counting length as one greater than the last index; and


These are the two items that IME trip up developers who are either not
careful or not aware of UTF-16 encoding details and don't test with non-BMP
input.  Frankly, JS developers should not have to be aware of character
encodings. Strings should just work.

I think that explicitly making strings Unicode and applying the fix above
would solve a *lot* of problems.  If I had this option, I would go so far
as to throw the BRS in my build processes, hg grep all our source code for
strings like D800 and eliminate all the extra UTF-16 machinations.

Another option might be to make ES.next have full Unicode strings; fix
.length and .charCodeAt etc when we are in ES.next context, leaving them
broken otherwise.  I'm not fond of this option, though: since there would
be no BRS, developers might often find themselves unsure of just what the
heck it is they are working with.

So, I like per-global BRS.

* supporting escapes with (up to) six hexadecimal digits.


This is necessary too; developers should be thinking about code points, not
encoding details.


 P2. The change is not backward compatible. In JS today, one read a string
 s from somewhere and hard-code, e.g., s.indexOf(0xd800 to find part of a
 surrogate pair, then advance to the next-indexed uint16 unit and read the
 other half, then combine to compute some result. Such usage would break.


While that is true in the general case, there are many specific cases where
that would not break. I'm thinking I have an implementation of
UnicodeStrlen around here somewhere which works by subracting the number of
0xD800 characters from .length.  In this case, that code would continue to
generate correct length counts because it would never find a 0xD800 in a
valid Unicode string (it's a reserved code point).


 We also wish to avoid exposing a full Unicode representation type and
 duplicated suite of the String static and prototype methods, as Java did.
 (We may well want UTF-N transcoding helpers; we certainly want ByteArray
 - UTF-8 transcoding APIs.)


These are both good goals, in particular, avoiding a full Unicode type
means reducing bug counts in the long term.

Is there a proposal for interaction with JSON?


 Also because inter-compartment traffic is (we conjecture) infrequent
 enough to tolerate the proxy/copy overhead.


Not to mention that the only thing you'd have to do is to tweak [[get]],
charCodeAt and .length when crossing boundaries; you can keep the same
backing store.

You might not even need to do this is the engine keeps the same backing
store for both kinds of strings.


 This means a script intent on comparing strings from two globals with
 different BRS settings could indeed tell that one discloses non-BMP
 char/codes, e.g. charCodeAt return values = 0x1. This is the *small*
 new observable I claim we can live with, because someone opted into it at
 least in one of the related global objects.


Funny question, if I have two strings, both hello, from two globals with
different BRS settings,  are they ==? How about ===?


 R1. To keep compatibility with DOM APIs, the DOM glue used to mediate
 calls from JS to (typically) C++ would have to proxy or copy any strings
 containing non-BMP characters. Strings with only BMP characters would work
 as today.


Is that true if the full unicode backing store is 16-bit code units using
UTF-16 encoding?  (Any way, it's an implementation detail)

In particular, Node.js can get modern at startup, and perhaps engines such
 as V8 as used in Node could even support compile-time (#ifdef) configury by
 which to support only full Unicode.


Sure, this is analogous to how SpiderMonkey deals with UTF-8 C Strings.
Flip a BRS before creating the runtime. :)

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Jussi Kalliokoski wrote:
I'm not sure what to think about this, being a big fan of the UTF-8 
simplicity. :) 


UTF-8 is great, but it's a transfer format, perfect for C and other such 
systems languages (especially ones that use byte-wide char from old 
days). It is not appropriate for JS, which gives users a One True 
String (sorry for caps) primitive type that has higher-level just 
Unicode semantics. Alas, JS's just Unicode was from '96.


There are lots of transfer formats and character set encodings. 
Implementations could use many, depending on what chars a given string 
uses. E.g. ASCII + UTF-16, UTF-8 only as you suggest, other 
combinations. But this would all be under the hood, and at some cost to 
the engine as well as some potential (space, mostly) savings.


But anyhow, I like the idea of opt-in, actually so much that I started 
thinking, why not make JS be encoding-agnostic?


That is precisely the idea. Setting the BRS to full Unicode gives the 
appearance of 21 bits per character via indexing and length accounting. 
You'd have to spell non-BMP literal escapes via \u{...}, no big deal.



What I mean here is that maybe we could have multi-charset Strings in JS?


Now you're saying something else. Having one agnostic higher-level just 
Unicode string type is one thing. That's JS's design goal, always has 
been. It does not imply adding multiple observable CSEs or UTFs that 
break the just Unicode abstraction.


If you can put a JS string in memory for low-level systems languages 
such as C to view, of course there are abstraction breaks. Engine APIs 
may or may not allow such views for optimizations. This is an issue, for 
sure, when embedding (e.g. V8 in Node). It's not a language design 
issue, though, and I'm focused on observables in the language because 
that is where JS currently fails by livin' in the '90s.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Axel Rauschmayer wrote:
es-discuss-only idea: could that BRS be made to carry more weight? 
Could it be a switch for all breaking ES.next changes?


What do you have in mind? It had better be important. We *just* had the 
breakthrough championed by dherman for One JavaScript. Why make 
trouble by adding runtime semantic changes unduly?


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Axel Rauschmayer
 es-discuss-only idea: could that BRS be made to carry more weight? Could it 
 be a switch for all breaking ES.next changes?
 
 What do you have in mind? It had better be important. We *just* had the 
 breakthrough championed by dherman for One JavaScript. Why make trouble by 
 adding runtime semantic changes unduly?

Two points:
- IIRC, attributes such as onclick are not yet solved, the ideas proposed 
sounded like a BRS, so maybe the two solutions can be combined.
- If we keep in the back of our minds the possibility of using the BRS as ES6 
opt-in, while going forward with One JavaScript (1JS), both approaches can be 
tested in the wild. I’m mostly sold on 1JS, but a few doubts remain, trying out 
the BRS clean cut solution for Unicode will either allay or confirm those 
doubts.

-- 
Dr. Axel Rauschmayer
a...@rauschma.de

home: rauschma.de
twitter: twitter.com/rauschma
blog: 2ality.com

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Axel Rauschmayer wrote:
es-discuss-only idea: could that BRS be made to carry more weight? 
Could it be a switch for all breaking ES.next changes?


What do you have in mind? It had better be important. We *just* had 
the breakthrough championed by dherman for One JavaScript. Why make 
trouble by adding runtime semantic changes unduly?


Two points:
- IIRC, attributes such as onclick are not yet solved, the ideas 
proposed sounded like a BRS, so maybe the two solutions can be combined.


One JavaScript means there's nothing to solve for event handlers. 
(Previously, with version opt-in, the solution was Content-Script- Type 
[1], via a HTTP header or meta http-equiv).


Could you state the problem with an example?

Perhaps you mean the issue of 'let' at top level in prior scripts (or 
'const'). I think we're all agreed that such bindings (as well as 
'module' bindings at top level) must be visible in event handlers.


- If we keep in the back of our minds the possibility of using the BRS 
as ES6 opt-in, while going forward with One JavaScript (1JS), both 
approaches can be tested in the wild. I’m mostly sold on 1JS, but a 
few doubts remain,


Namely?

We have to test whether extensions such as const and function-in-block 
can be broken, but Apple and Mozilla (at least) are covering that.


We shouldn't over-hedge or try doing two things less well, instead of 
one thing well.


trying out the BRS clean cut solution for Unicode will either allay 
or confirm those doubts.
Adding a BRS and then starting to hang other hats on it is a design 
mistake. When in doubt, leave it out. This proposal is only for Unicode. 
Of course we can consider other uses if the need truly arises, but we 
should not go looking for them right now.


/be

[1] http://www.w3.org/TR/html4/interact/scripts.html#h-18.2.2.1

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Brendan Eich wrote:
My R2 resolution is not specific to any engine, but I have hopes it 
can be accepted. It is concrete enough to help overcome 
large-yet-vague doubts about implementation impact (at least IMHO). 
Recall that document.domain setting may have to split a merged 
same-origin window/frame graph, at any time. Again implementation 
solutions vary, but this suggests cross-window mediation can be 
interposed lazily.


Another point: HTML5 WindowProxy (vs. Window, the global object on the 
scope chain) exists to solve navigation-away-from-same-origin security 
problems. Any JS that passes strings from one window to another must 
be using a WindowProxy to reference the other. There's a mediation 
point too.


IOW, whether there' s a heap/compartment per global is not critical (but 
if there is, then strings are already copied and could be transcoded as 
to their meta-data distinguishing non-BMP+UTF16-aka-full-Unicode from 
bad-ol'-90s-UCS2). The cross-window mediation via WindowProxy is.


The flag bits I mentioned could be combined: 1 flag bit for both 
non-BMP-chars-in-this-string-*and*-full-Unicode-BRS-setting-in-effect-for-its-global.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread David Bruant
Le 19/02/2012 09:33, Brendan Eich a écrit :
 (...)

 How is the BRS configured? Again, not via a pragma, and not by
 imperative state update inside the language (mutating hidden BRS state
 at a given program point could leave strings created before mutation
 observably different from those created after, unless the
 implementation in effect scanned the local heap and wrapped or copied
 any non-BMP-char-bearing ones creatd before).

 The obvious way to express the BRS in HTML is a meta tag in document
 head, but I don't want to get hung up on this point. I do welcome
 expert guidance. Here is another W3C/WHATWG interaction point. For
 this reason I'm cc'ing public-script-coord.
I'm not sure a meta is that obvious of a choice.

A bit of related experience about metas:
At the end of 2011, I started a thread [1] about the meta referrer [2].
One use case for it would be to set the value to never so that you
declare that document-wise, no HTTP referer header should be sent when
downloading a script/stylesheet/image or clicking a link/posting a form.
An intervention by Simon Pieters [3] mentionned speculative parsing and
the fact that resources may be fetched *before* the meta is read, hence
leaking the referer while the intention of the author might have been
that there should be no leak.
Since there seems to be no satisfying HTML-based solution for this, I
suggested [4] that it's when the document is delivered by the server
that the server should express how the document should behave regarding
sending referer headers.
The discussion ended by Adam Barth agreeing [5] and planning to propose
this for CSP 1.1 (That's how I learned about CSP [6]).

Unless I'm missing something, I think the same discussion can be had
about the BRS being declared as a meta. Consider:

script
// some code that can observe the difference between BRS mode and
non-BRS
/script
meta BRS

Should the browser read all metas before executing any script? Worse:
what if an inline script does document.write('meta BRS')?

I think a CSP-like solution should be explored.

As a side note, after some time studying the Content Security Policy
(CSP), I came to realize that it doesn't have to be related security
(though that's what motivated it in the first place) and could be
considered as a Content Delivery Policy, offering space to break
semantics and repair things that would be worth the cost of the opt-in
(like the script execution rules or when the referer header is sent).

Worth exploring for the BRS or certainly a lot of other things.

David

[1]
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-December/034275.html
[2] http://wiki.whatwg.org/index.php?title=Meta_referrer
[3]
http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2012-January/034283.html
[4]
http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2012-January/034522.html
[5]
http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2012-January/034523.html
[6]
https://dvcs.w3.org/hg/content-security-policy/raw-file/tip/csp-specification.dev.html
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Mark S. Miller
On Sun, Feb 19, 2012 at 11:49 AM, Brendan Eich bren...@mozilla.com wrote:
[...]

 Not all engines mediate cross-same-origin-window accesses. I hear IE9+
 may, indeed rumor is it remotes to another process sometimes (breaking
 run-to-completion a bit; something we should explore breaking in the future
 for window=vat). SpiderMonkey just recently (not sure if this is in a
 Firefox channel yet) went to compartment per global, for good savings once
 things were refactored to maximize sharing of internal immutables.


Other than the origin truncation issue that I am still confused about, what
other benefits are there to mediating interframe access within the same
origin?



 My R2 resolution is not specific to any engine, but I have hopes it can be
 accepted. It is concrete enough to help overcome large-yet-vague doubts
 about implementation impact (at least IMHO). Recall that document.domain
 setting may have to split a merged same-origin window/frame graph, at any
 time. Again implementation solutions vary, but this suggests cross-window
 mediation can be interposed lazily.


How? By doing a full walk of the object graph and doing surgery on it? This
sounds more painful than imposing mediation up front. But I'm still hoping
that objects same origin iframes can communicate directly, without
mediation.



 Another point: HTML5 WindowProxy (vs. Window, the global object on the
 scope chain) exists to solve navigation-away-from-same-**origin security
 problems. Any JS that passes strings from one window to another must be
 using a WindowProxy to reference the other. There's a mediation point too.

 /be




-- 
Cheers,
--MarkM
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Wes Garland wrote:

Is there a proposal for interaction with JSON?


From http://www.ietf.org/rfc/rfc4627, 2.5:

   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   \uD834\uDD1E.



Also because inter-compartment traffic is (we conjecture)
infrequent enough to tolerate the proxy/copy overhead.


Not to mention that the only thing you'd have to do is to tweak 
[[get]], charCodeAt and .length when crossing boundaries; you can keep 
the same backing store.


String methods are not generally self-hosted, so internal C++ vector 
access would need to change depending on the string's flag bit, in this 
implementation approach.


You might not even need to do this is the engine keeps the same 
backing store for both kinds of strings.


Yes, sharing the uint16 vector is good. But string methods would have to 
index and .length differently (if I can verb .length ;-).



This means a script intent on comparing strings from two globals
with different BRS settings could indeed tell that one discloses
non-BMP char/codes, e.g. charCodeAt return values = 0x1. This
is the *small* new observable I claim we can live with, because
someone opted into it at least in one of the related global objects.


Funny question, if I have two strings, both hello, from two globals 
with different BRS settings,  are they ==? How about ===?


Of course, strings with the same characters are == and ===. Strings 
appear to be values. If you think of them as immutable reference types 
there's still an obligation to compare characters for strings because 
computed strings are not intern'ed.



R1. To keep compatibility with DOM APIs, the DOM glue used to
mediate calls from JS to (typically) C++ would have to proxy or
copy any strings containing non-BMP characters. Strings with only
BMP characters would work as today.


Is that true if the full unicode backing store is 16-bit code units 
using UTF-16 encoding?  (Any way, it's an implementation detail)


Yes, because DOMString has intrinsic length and indexing notions and 
these must (pending any coordination with w3c) remain ignorant of the 
BRS and livin' in the '90s (DOM too emerged in the UCS-2 era).


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Mark S. Miller wrote:
On Sun, Feb 19, 2012 at 11:49 AM, Brendan Eich bren...@mozilla.com 
mailto:bren...@mozilla.com wrote:

[...]

Not all engines mediate cross-same-origin-window accesses. I hear
IE9+ may, indeed rumor is it remotes to another process sometimes
(breaking run-to-completion a bit; something we should explore
breaking in the future for window=vat). SpiderMonkey just recently
(not sure if this is in a Firefox channel yet) went to compartment
per global, for good savings once things were refactored to
maximize sharing of internal immutables.


Other than the origin truncation issue that I am still confused about,


Do you mean document.domain setting? That allows code in an origin to 
join its origin's super-domain (but not a dotless top level). See


http://www.w3.org/TR/2009/WD-html5-20090212/browsers.html#dom-document-domain

and

http://www.w3.org/TR/2009/WD-html5-20090212/browsers.html#effective-script-origin

what other benefits are there to mediating interframe access within 
the same origin?


The WindowProxy in HTML5 reflects a de-facto standard developed by 
browser implementors to avoid 
closure-survives-navigation-to-other-origin attacks. See


http://www.w3.org/TR/html5/browsers.html#the-windowproxy-object

Demons from the First Age included attacks that loaded a document 
containing a script defining a closure from evil.org into a subframe, 
then stuck a ref to the closure in the super-frame, then navigated the 
sub-frame to victim.com. Guess whose scope the closure saw, with only 
Window objects and no WindowProxy wrappers for the named (not implicit 
in identifier resolution) window/frame objects?



My R2 resolution is not specific to any engine, but I have hopes
it can be accepted. It is concrete enough to help overcome
large-yet-vague doubts about implementation impact (at least
IMHO). Recall that document.domain setting may have to split a
merged same-origin window/frame graph, at any time. Again
implementation solutions vary, but this suggests cross-window
mediation can be interposed lazily.


How? By doing a full walk of the object graph and doing surgery on it? 
This sounds more painful than imposing mediation up front.


No, by indirection, of course ;-). The details vary among browsers.

But I'm still hoping that objects same origin iframes can communicate 
directly, without mediation.


Why? Anyway, it's unsafe, wherefore WindowProxy. No big deal. There's no 
mediation for identifier resolution (i.e., scope chain lookup) and 
indeed JITting VMs optimize the heck out of local global accesses already.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

David Bruant wrote:

Le 19/02/2012 09:33, Brendan Eich a écrit :

  (...)

  How is the BRS configured? Again, not via a pragma, and not by
  imperative state update inside the language (mutating hidden BRS state
  at a given program point could leave strings created before mutation
  observably different from those created after, unless the
  implementation in effect scanned the local heap and wrapped or copied
  any non-BMP-char-bearing ones creatd before).

  The obvious way to express the BRS in HTML is ameta  tag in document
  head, but I don't want to get hung up on this point. I do welcome
  expert guidance. Here is another W3C/WHATWG interaction point. For
  this reason I'm cc'ing public-script-coord.

I'm not sure ameta  is that obvious of a choice.


Sure, guidance welcome as noted. I probably should have started with an 
HTTP header, but then authors may prefer to set it with meta 
http-equiv... which is verbose:


meta http-equiv=ECMAScript-Full-Unicode content=1 /

We can't have

meta http-equiv=BRS content=1 /

as BRS is too short and obscure. It's a good joke (should 
s/switch/button/ -- the big red button was the button Elmer Fudd warned 
Daffy Duck never to press in Design for Leaving: 
http://www.youtube.com/watch?v=gms_NKzNLUs). Anyway, whatever the header 
name it will be a pain to type the full meta tag.



Unless I'm missing something, I think the same discussion can be had
about the BRS being declared as ameta. Consider:

 script
 // some code that can observe the difference between BRS mode and
non-BRS
 /script
 meta BRS

Should the browser read allmetas before executing any script? Worse:
what if an inline script does document.write('meta BRS')?


Since I was thinking of meta http-equiv (possibly with a short-hand), 
your example simply puts the meta out of order. It can't work, so it 
should not work (console warning traffic appropriate).


In mentioning meta I did not mean to exclude better ideas. Obviously a 
multi-window/frame app might want a Really Big Red Switch expressed in 
one place only. Ignoring Web Apps with manifest files, where would that 
place be? Hmm, CSP...




I think a CSP-like solution should be explored.


Good suggestion. I hope others on the lists are up-to-date on CSP.

/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Brendan Eich wrote:
the big red button was the button Elmer Fudd warned Daffy Duck never 
to press in Design for Leaving: 
http://www.youtube.com/watch?v=gms_NKzNLUs


Got Elmer and Daffy reversed there --getting old!

/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: New full Unicode for ES6 idea

2012-02-19 Thread Phillips, Addison
Why would converting the existing UCS-2 support to be UTF-16 not be a good 
idea? There is nothing intrinsically wrong that I can see with that approach 
and it would be the most compatible with existing scripts, with no special 
modes, flags, or interactions. 

Yes, the complexity of supplementary characters (i.e. non-BMP characters) 
represented as surrogate pairs must still be dealt with. It would also expose 
the possibility of invalid strings (with unpaired surrogates). But this would 
not be unlike other programming languages--or even ES as it exists today. The 
purity of a Unicode string would be watered down, but perhaps not fatally. 
The Java language went through this (yeah, I know, I know...) and seems to have 
emerged unscathed. Norbert has a lovely doc here about the choices that lead to 
this, which seems useful to consider: [1]. W3C I18N Core WG has a wiki page 
shared with TC39 awhile ago here: [2].

To me, switching to UTF-16 seems like a relatively small, containable, 
non-destructive change to allow supplementary character support. It's not a 
pure as a true code-point based Unicode string solution. But purity isn't 
everything.

What am I missing?

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG) --- hat is OFF in this message

Internationalization is not a feature.
It is an architecture.

[1] http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
[2] http://www.w3.org/International/wiki/JavaScriptInternationalization




 -Original Message-
 From: Brendan Eich [mailto:bren...@mozilla.com]
 Sent: Sunday, February 19, 2012 1:34 PM
 To: Wes Garland
 Cc: es-discuss; public-script-co...@w3.org; mran...@voxer.com
 Subject: Re: New full Unicode for ES6 idea
 
 Wes Garland wrote:
  Is there a proposal for interaction with JSON?
 
  From http://www.ietf.org/rfc/rfc4627, 2.5:
 
 To escape an extended character that is not in the Basic Multilingual
 Plane, the character is represented as a twelve-character sequence,
 encoding the UTF-16 surrogate pair.  So, for example, a string
 containing only the G clef character (U+1D11E) may be represented as
 \uD834\uDD1E.
 
 
  Also because inter-compartment traffic is (we conjecture)
  infrequent enough to tolerate the proxy/copy overhead.
 
 
  Not to mention that the only thing you'd have to do is to tweak
  [[get]], charCodeAt and .length when crossing boundaries; you can keep
  the same backing store.
 
 String methods are not generally self-hosted, so internal C++ vector access
 would need to change depending on the string's flag bit, in this
 implementation approach.
 
  You might not even need to do this is the engine keeps the same
  backing store for both kinds of strings.
 
 Yes, sharing the uint16 vector is good. But string methods would have to
 index and .length differently (if I can verb .length ;-).
 
  This means a script intent on comparing strings from two globals
  with different BRS settings could indeed tell that one discloses
  non-BMP char/codes, e.g. charCodeAt return values = 0x1. This
  is the *small* new observable I claim we can live with, because
  someone opted into it at least in one of the related global objects.
 
 
  Funny question, if I have two strings, both hello, from two globals
  with different BRS settings,  are they ==? How about ===?
 
 Of course, strings with the same characters are == and ===. Strings appear to
 be values. If you think of them as immutable reference types there's still an
 obligation to compare characters for strings because computed strings are
 not intern'ed.
 
  R1. To keep compatibility with DOM APIs, the DOM glue used to
  mediate calls from JS to (typically) C++ would have to proxy or
  copy any strings containing non-BMP characters. Strings with only
  BMP characters would work as today.
 
 
  Is that true if the full unicode backing store is 16-bit code units
  using UTF-16 encoding?  (Any way, it's an implementation detail)
 
 Yes, because DOMString has intrinsic length and indexing notions and these
 must (pending any coordination with w3c) remain ignorant of the BRS and
 livin' in the '90s (DOM too emerged in the UCS-2 era).
 
 /be

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Anne van Kesteren wrote:
On Sun, 19 Feb 2012 21:29:48 +0100, David Bruant bruan...@gmail.com 
wrote:

I think a CSP-like solution should be explored.


FWIW, the feedback on CORS (CSP-like) thus far has been that it's 
quite hard to set up custom headers.


I've heard this for years, can believe it in old-school big-company 
settings, but have a not-to-be-shattered hope that with Node.js etc. it 
is easier for content authors to configure headers. Go on, break my heart!


So for something as commonly used as JavaScript I'm not sure we'd want 
to require that. And although more difficult, if we want meta it can 
be made to work, it's just more complicated than simply defining a 
name and a value. But maybe it should be something simpler, e.g.


html unicode

in the top-level browsing context's document.


That's pretty but is it misleading? This is the big-red-switch-for-JS, 
not for the whole doc. In particular what is the Content-Type, with what 
charset parameter, and how does this attribute interact? Perhaps it's 
just misnamed.


What are libraries supposed to do by the way, check the length of  
and adjust code accordingly?


Most JS libraries (I'd love to see couterexamples) do not process 
surrogate pairs at all. They too live in the '90s.


As far as the DOM and Web IDL are concerned, I think we would need two 
definitions for code unit. One that means 16-bit code unit and one 
that means Unicode code unit


I'm not a Unicode expert but I believe the latter is called character.

or some such. Looking at 
http://dvcs.w3.org/hg/domcore/raw-file/tip/Overview.html#characterdata 
the rest should follow quite naturally.


What happens with surrogate code points in these new strings? I think 
we do not want to change that each unit is an integer of some kind and 
can be set to any value. And if that is the case, will it hold values 
greater than U+10?


JS must keep the \u notation for uint16 storage units, and one can 
create invalid Unicode strings already. This hazard does not go away, we 
keep compatibility, but the BRS adds no new hazards and in practice, if 
well-used, should reduce the incidence of invalid-Unicode-string bugs.


The \u{...} notation is independent and should work whatever the BRS 
setting, IMHO. In UCS-2 (default) setting, \u{...} can make pairs. 
In UTF-16 setting, it makes only characters. And of course in the 
latter case indexing and length count characters.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread David Bruant
Le 19/02/2012 22:57, Anne van Kesteren a écrit :
 On Sun, 19 Feb 2012 21:29:48 +0100, David Bruant bruan...@gmail.com
 wrote:
 I think a CSP-like solution should be explored.

 FWIW, the feedback on CORS (CSP-like) thus far has been that it's
 quite hard to set up custom headers.
Do you have reference on this feedback? Under which circumstances is it
hard?
One major annoyance I see in HTTP headers is that I have never heard of
an hosting service allowing to choose the HTTP your services is served
with and that's problematic. meta http-equiv is of some help to
provide the feature without having control over the HTTP response, but
in some cases, we want the HTTP header to mean something that is
document-wise and a meta can be too late.


 So for something as commonly used as JavaScript I'm not sure we'd want
 to require that. And although more difficult, if we want meta it can
 be made to work, it's just more complicated than simply defining a
 name and a value. But maybe it should be something simpler, e.g.

   html unicode

 in the top-level browsing context's document.
I'm not sure it solves anything since a script could be the first thing
an HTML renderer comes across, even before a doctype, even before an
html starting tag.
My guess would be that the HTML spec defines that this script should be
executed even if the script opening tag are the first bytes of the
document.

David
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Boris Zbarsky

On 2/19/12 3:31 PM, Mark S. Miller wrote:

Other than the origin truncation issue that I am still confused about,
what other benefits are there to mediating interframe access within the
same origin?


In Gecko's case, at least, there are certain benefits to garbage 
collection, memory locality, memory accounting, faster determination of 
an object's effective origin, etc.


The important part being the separate per-frame heaps; the mediation is 
just a consequence.


-Boris
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Phillips, Addison wrote:

Why would converting the existing UCS-2 support to be UTF-16 not be a good idea? There is nothing 
intrinsically wrong that I can see with that approach and it would be the most compatible with 
existing scripts, with no special modes, flags, or interactions.


Allen proposed this, essentially (some confusion surrounded the 
discussion by mixing observable-in-language with 
encoding/format/serialization issues, leading to talk of 32-bit 
characters), last year. As I wrote in the o.p., this led to two 
objections: big implementation hit; incompatible change.


I tackled the second with the BRS and (in detail) mediation across DOM 
window boundaries. This I believe takes the sting out of the first 
(lesser implementation change in light of existing mediation at those 
boundaries).



Yes, the complexity of supplementary characters (i.e. non-BMP characters) 
represented as surrogate pairs must still be dealt with.


I'm not sure what you mean. JS today allows (ignoring invalid pairs) 
such surrogates but they count as two indexes and add two to length, not 
one. That is the first problem to fix (ignoring literal escape-notation 
expressiveness).



  It would also expose the possibility of invalid strings (with unpaired 
surrogates).


That problem exists today.


  But this would not be unlike other programming languages--or even ES as it 
exists today.


Right! We should do better. As I noted, Node.js heavy hitters (mranney 
of Voxer) testify that they want full Unicode, not what's specified 
today with indexing and length-accounting by uint16 storage units.



  The purity of a Unicode string would be watered down, but perhaps not 
fatally. The Java language went through this (yeah, I know, I know...) and seems to have 
emerged unscathed.


Java's dead on the client. It is used by botnets (bugzilla.mozilla.org 
recently suffered a DDOS from one, the bad guys didn't even bother 
changing the user-agent from the default one for the Java runtime). See 
Brian Krebs' blog.



  Norbert has a lovely doc here about the choices that lead to this, which 
seems useful to consider: [1]. W3C I18N Core WG has a wiki page shared with 
TC39 awhile ago here: [2].

To me, switching to UTF-16 seems like a relatively small, containable, 
non-destructive change to allow supplementary character support.


I still don't know what you mean. How would what you call switching to 
UTF-16 differ from today, where one can inject surrogates into literals 
by transcoding from an HTML document or .js file CSE?


In particular, what do string indexing and .length count, uint16 units 
or characters?


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Allen Wirfs-Brock

On Feb 19, 2012, at 2:15 PM, Brendan Eich wrote:

 Anne van Kesteren wrote:
 ...
 
 As far as the DOM and Web IDL are concerned, I think we would need two 
 definitions for code unit. One that means 16-bit code unit and one that 
 means Unicode code unit
 
 I'm not a Unicode expert but I believe the latter is called character.

Me neither, but I believe the correct term is code point which refers to the 
full 21-bit code while Unicode character is the logical entity corresponding 
to that code point.   That usage of character is difference from the current 
usage within ECMAScript where character is what we call the elements of the 
vector of 16-bit number that are used to represent a String value.   You can 
access then as sting values of length 1 via [ ] or as numeric values via the 
charCodeAt method.

 
 or some such. Looking at 
 http://dvcs.w3.org/hg/domcore/raw-file/tip/Overview.html#characterdata the 
 rest should follow quite naturally.
 
 What happens with surrogate code points in these new strings? I think we do 
 not want to change that each unit is an integer of some kind and can be set 
 to any value. And if that is the case, will it hold values greater than 
 U+10?
 
 JS must keep the \u notation for uint16 storage units, and one can 
 create invalid Unicode strings already. This hazard does not go away, we keep 
 compatibility, but the BRS adds no new hazards and in practice, if well-used, 
 should reduce the incidence of invalid-Unicode-string bugs.
 
 The \u{...} notation is independent and should work whatever the BRS 
 setting, IMHO. In UCS-2 (default) setting, \u{...} can make pairs. In 
 UTF-16 setting, it makes only characters. And of course in the latter case 
 indexing and length count characters.

I think your names for the BRS modes are misleading.  What you call UTF-16 
actually manifests itself to the ES programmer as UTF-32 as each index position 
within a string corresponds to a unencoded Unicode code point.  There are no 
visible UTF-16 surrogate pairs, even if the implementation is internally using 
a UTF-16 encoding. 

Similarly, UCS-2 as currently implemented actually manifests itself to the ES 
programmer as UTF-16 because implementations turn non-BMP string literal 
characters into UTF-16 surrogate pairs that visibly occupy two index positions.

Allen
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Allen Wirfs-Brock wrote:

On Feb 19, 2012, at 2:15 PM, Brendan Eich wrote:
I'm not a Unicode expert but I believe the latter is called character. 


Me neither, but I believe the correct term is code point which refers to the full 21-bit code while 
Unicode character is the logical entity corresponding to that code point.   That usage of 
character is difference from the current usage within ECMAScript where character is what we 
call the elements of the vector of 16-bit number that are used to represent a String value.   You can access then as 
string values of length 1 via [ ] or as numeric values via the charCodeAt method.


Thanks. We have a confusing transposition of terms between Unicode and 
ECMA-262, it seems. Should we fix?



JS must keep the \u notation for uint16 storage units, and one can create 
invalid Unicode strings already. This hazard does not go away, we keep compatibility, but 
the BRS adds no new hazards and in practice, if well-used, should reduce the incidence of 
invalid-Unicode-string bugs.

The \u{...} notation is independent and should work whatever the BRS setting, IMHO. In UCS-2 
(default) setting, \u{...} can make pairs. In UTF-16 setting, it makes only characters. And of 
course in the latter case indexing and length count characters.


I think your names for the BRS modes are misleading.


You got me, in fact I used full Unicode for the BRS-thrown setting 
elsewhere.


My implementor's bias is showing, because I expect many engines would 
use UTF-16 internally and have non-O(1) indexing for strings with the 
contains-non-BMP-and-BRS-set-to-full-Unicode flag bit.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Brendan Eich wrote:

Mark S. Miller wrote:
On Sun, Feb 19, 2012 at 12:33 AM, Brendan Eich bren...@mozilla.com 
mailto:bren...@mozilla.com wrote:

[...]

Why the global object? Because for many VMs, each global has its
own heap or sub-heap (compartment), and all references outside
that heap are to local proxies that copy from, or in the case of
immutable data, reference the remote heap.
[...]

Is this true for same origin iframes? I have always assumed that 
mixing heaps between same origin iframes results in unmediated direct 
object-to-object access. If these are already mediated, what was the 
issue that drove us to that?


Not all engines mediate cross-same-origin-window accesses.


Sorry, I misused mediate incorrectly here to mean heap/compartment 
isolation. All engines in browsers that conform to HTML5 must mediate 
cross-frame Window (global object) accesses via WindowProxy, as 
discussed in other followups.


I hear IE9+ may, indeed rumor is it remotes to another process 
sometimes (breaking run-to-completion a bit; something we should 
explore breaking in the future for window=vat).


(Hope that parenthetical aside has you charged up -- we need a fresh 
thread on that topic, though... ;-)


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Allen Wirfs-Brock

On Feb 19, 2012, at 2:44 PM, Brendan Eich wrote:

 Allen Wirfs-Brock wrote:
 On Feb 19, 2012, at 2:15 PM, Brendan Eich wrote:
 I'm not a Unicode expert but I believe the latter is called character. 
 
 Me neither, but I believe the correct term is code point which refers to 
 the full 21-bit code while Unicode character is the logical entity 
 corresponding to that code point.   That usage of character is difference 
 from the current usage within ECMAScript where character is what we call 
 the elements of the vector of 16-bit number that are used to represent a 
 String value.   You can access then as string values of length 1 via [ ] or 
 as numeric values via the charCodeAt method.
 
 Thanks. We have a confusing transposition of terms between Unicode and 
 ECMA-262, it seems. Should we fix?

The ES5.1 spec.is ok because it always uses (as defined in section 6)  the term 
Unicode character  when it means exactly that and uses character when 
talking about the elements of String values. It says that both code unit and 
character  refer to a 16-bit unsigned value.

Your proposal would change that equivalence. In one sense, the BSR would be a 
switch that controls whether a ES character corresponds to code unit or a 
code point

 
 JS must keep the \u notation for uint16 storage units, and one can 
 create invalid Unicode strings already. This hazard does not go away, we 
 keep compatibility, but the BRS adds no new hazards and in practice, if 
 well-used, should reduce the incidence of invalid-Unicode-string bugs.
 
 The \u{...} notation is independent and should work whatever the BRS 
 setting, IMHO. In UCS-2 (default) setting, \u{...} can make pairs. In 
 UTF-16 setting, it makes only characters. And of course in the latter 
 case indexing and length count characters.
 
 I think your names for the BRS modes are misleading.
 
 You got me, in fact I used full Unicode for the BRS-thrown setting 
 elsewhere.
 
 My implementor's bias is showing, because I expect many engines would use 
 UTF-16 internally and have non-O(1) indexing for strings with the 
 contains-non-BMP-and-BRS-set-to-full-Unicode flag bit.

A fine implementation, but not observable.  Another implementation approach 
that would preserve O(1) indexing would be to simply have two or three 
different internal string representations with 1, 2, or 4 byte internal 
characters.  (You can automatically pick the needed character size when the 
string is created because string are immutable and created with their value).  
A not-quite O(1) approach would segment strings into substring spans using such 
an representation.   Representation choice probably depends a lot on what you 
think are the most common use cases.  If it is string processing in JS then a 
fast representations is probably what you want to choose.  If it is just 
passing text  that is already UTF-8 or UTF-16  encoded from inputs to output 
then a representation that minimizing transcoding would probably be a higher 
priority.

Allen
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Allen Wirfs-Brock wrote:

On Feb 19, 2012, at 2:44 PM, Brendan Eich wrote:

Thanks. We have a confusing transposition of terms between Unicode and 
ECMA-262, it seems. Should we fix?


The ES5.1 spec.is ok because it always uses (as defined in section 6)  the term Unicode character  when it 
means exactly that and uses character when talking about the elements of String values. It says that both 
code unit and character  refer to a 16-bit unsigned value.


That is still pretty confusing. I hope we can stop abusing character 
by overloading it in ECMA-262 in this way.



Your proposal would change that equivalence. In one sense, the BSR would be a switch that controls whether a 
ES character corresponds to code unit or a code point


Yes, and we might rather have a different word on that basis too.

How about character element? Element to capture indexing as the 
means of accessing the thing in question.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Trimming to es-discuss.

Brendan Eich wrote:
How about character element? Element to capture indexing as the 
means of accessing the thing in question.


Or avoid the c-word altogether via string element or string indexed 
property? Latter's too long but you see what I mean.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Allen Wirfs-Brock

On Feb 19, 2012, at 3:18 PM, Brendan Eich wrote:

 Allen Wirfs-Brock wrote:
 ...
 Your proposal would change that equivalence. In one sense, the BSR would be 
 a switch that controls whether a ES character corresponds to code unit 
 or a code point
 
 Yes, and we might rather have a different word on that basis too.
 
 How about character element? Element to capture indexing as the means of 
 accessing the thing in question.

I generally try to use element in informal contexts for exactly that reason.  
However, shouldn't it be string element  and we could let character and 
Unicode character mean the same thing?

Allen___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Mark S. Miller
On Sun, Feb 19, 2012 at 1:52 PM, Brendan Eich bren...@mozilla.com wrote:
[...]

 How? By doing a full walk of the object graph and doing surgery on it?
 This sounds more painful than imposing mediation up front.


 No, by indirection, of course ;-). The details vary among browsers.


I think just we're having a terminology problem. To me, such indirection is
mediation.

-- 
Cheers,
--MarkM
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Mark S. Miller wrote:
On Sun, Feb 19, 2012 at 1:52 PM, Brendan Eich bren...@mozilla.com 
mailto:bren...@mozilla.com wrote:

[...]

How? By doing a full walk of the object graph and doing
surgery on it? This sounds more painful than imposing
mediation up front.


No, by indirection, of course ;-). The details vary among browsers.


I think just we're having a terminology problem. To me, such 
indirection is mediation.


Definitely I was unclear. The (different) mediation by WindowProxy is 
good because local global (oxymoronic, ugh -- let's say this global) 
accesses are unmediated and can be super-optimized.


The mediation by trust-label indirection I was referring to above is 
for-all-accesses. That is painful.


A compartment per global makes the process of finding the trust-label 
(called a Principal in Gecko) significantly faster.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Allen Wirfs-Brock wrote:

On Feb 19, 2012, at 3:18 PM, Brendan Eich wrote:


Allen Wirfs-Brock wrote:
...
Your proposal would change that equivalence. In one sense, the BSR 
would be a switch that controls whether a ES character corresponds 
to code unit or a code point


Yes, and we might rather have a different word on that basis too.

How about character element? Element to capture indexing as the 
means of accessing the thing in question.


I generally try to use element in informal contexts for exactly that 
reason.  However, shouldn't it be string element  and we could let 
character and Unicode character mean the same thing?


Yes, see my es-discuss+you followup -- should have measured twice and 
cut once.


I like this much better than anything overloading character.

To hope to make this sideshow beneficial to all the cc: list, what do 
DOM specs use to talk about uint16 units vs. code points?


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Cameron McCormack

Brendan Eich:
 To hope to make this sideshow beneficial to all the cc: list, what do
 DOM specs use to talk about uint16 units vs. code points?

I say code unit as a shorter way of saying 16 bit unsigned integer 
code unit


  http://dev.w3.org/2006/webapi/WebIDL/#dfn-code-unit

(which DOM4 also links to) and then just code point to refer to 21 bit 
numbers that might correspond to a Unicode character, which you can see 
used in


  http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Allen Wirfs-Brock

On Feb 19, 2012, at 1:34 PM, Brendan Eich wrote:

 Wes Garland wrote:
 Is there a proposal for interaction with JSON?
 
 From http://www.ietf.org/rfc/rfc4627, 2.5:
 
   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   \uD834\uDD1E.

I think it is actually more complex than just the above.  2.5 also says:

All Unicode characters may be placed within the quotation marks except for the 
characters that must be escaped: quotation mark, reverse solidus, and the 
control characters (U+ through U+001F). (emphasis added)

and 3. says:

JSON text SHALL be encoded in Unicode.  The default encoding is UTF-8. and 
then goes on to talk about how to detect UTF-8, 16, and 32 LE and BE encodings. 
 So all those are legal.

It is presumably up a a JSON parser to decide how non-BMP characters in strings 
are encoded for whatever internal representation it is targeting.  Currently JS 
JSON.parse takes its input from a JavaScript string that is composed of 16-bit 
UCS-2 elements so there are no unencoded non-BMP characters in the string. 
However, according to the ES5.1 spec, JSON.parse (and JSON.stringify)  will 
just pass through any UTF-16 surrogate pairs that are encountered. 

With the BRS, JSON.parse and JSON.stringify could encounter non-BMP characters 
in the JS string it is processing and those also would presumably pass through 
transparently.  The one requirement of rfc 4627 that would be impacted by the 
BRS would be the 12-charcter escape sequences mentioned above.  Currently 
JSON.parse implementations encode those as UTF-16 surrogate pairs in the 
generated strings. If the BSR is flipped, the rfc seems to require that  they 
generate a single string element.  Because, the JSON.stringify spec. does not 
escape anything other than control characters, any non-BMP characters it 
encounter would pass through unencoded.   This implies that JSON.parse input of 
the form \uD834\uDD1E would probably round trip back out via JSON.stringify 
as JSON string containing the single unencoded G clef character.  Logically 
equivalent but not the identical JSON text.

Allen


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Mark Davis ☕
First, it would be great to get full Unicode support in JS. I know that's
been a problem for us at Google.

Secondly, while I agree with Addison that the approach that Java took is
workable, it does cause problems. Ideally someone would be able to loop (a
very common construct) with:

for (codepoint cp : someString) {
  doSomethingWith(cp);
}

In Java, you have to do:

int cp;
for (int i = 0; i  someString.length(); i += Character.countChar(cp)) {
  cp = someString.codePointAt(i);
  doSomethingWith(cp);
}

There are good reasons for why Java did what it did, basically for
compatibility. But if there is some way that JS can work around those,
that'd be great.

3. There's some confusion about the Unicode terminology. Here's a quick
clarification:

code point: number from 0 to 0x10

character: a code point that is assigned. Eg, 0x61 represents 'a' and is a
character. 0x378 is a code point, but not (yet) a character.

code unit: an encoding 'chunk'.
UTF-8 represents a code point as 1-4 8-bit code units
UTF-16 represents a code point  as 2 or 4 16-bit code units
UTF-32 represents a code point as 1 32-bit code unit.

--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Sun, Feb 19, 2012 at 16:00, Cameron McCormack c...@mcc.id.au wrote:

 Brendan Eich:

  To hope to make this sideshow beneficial to all the cc: list, what do
  DOM specs use to talk about uint16 units vs. code points?

 I say code unit as a shorter way of saying 16 bit unsigned integer code
 unit

  
 http://dev.w3.org/2006/webapi/**WebIDL/#dfn-code-unithttp://dev.w3.org/2006/webapi/WebIDL/#dfn-code-unit

 (which DOM4 also links to) and then just code point to refer to 21 bit
 numbers that might correspond to a Unicode character, which you can see
 used in

  
 http://dev.w3.org/2006/webapi/**WebIDL/#dfn-obtain-unicodehttp://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode

 __**_
 es-discuss mailing list
 es-discuss@mozilla.org
 https://mail.mozilla.org/**listinfo/es-discusshttps://mail.mozilla.org/listinfo/es-discuss

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: New full Unicode for ES6 idea

2012-02-19 Thread Phillips, Addison
Mark wrote:

First, it would be great to get full Unicode support in JS. I know that's been 
a problem for us at Google.

AP +1: I think we’ve waited for supplementary character support long enough!

Secondly, while I agree with Addison that the approach that Java took is 
workable, it does cause problems.

AP The tension is between “compatibility” and “ease of use” here, I think. The 
question is whether very many scripts depend on the ‘uint16’ nature of a 
character in ES, use surrogates to effect supplementary character support, or 
are otherwise tied to the existing encoding model and are broken as a result of 
changes. In its ideal form, an ES string would logically be a sequence of 
Unicode characters (code points) and only the internal representation would 
worry about whatever character encoding scheme made the most sense (in many 
cases, this might actually be UTF-16).

AP … but what I think is hard to deal with are different modes of processing 
scripts depending on “fullness of the Unicode inside”. Admittedly, the approach 
I favor is rather conservative and presents a number of challenges, most 
notably in adapting regex or for users who want to work strictly in terms of 
character values.

There are good reasons for why Java did what it did, basically for 
compatibility. But if there is some way that JS can work around those, that'd 
be great.

AP Yes, it would.

~Addison

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Gavin Barraclough
On Feb 19, 2012, at 3:13 PM, Allen Wirfs-Brock wrote:
 My implementor's bias is showing, because I expect many engines would use 
 UTF-16 internally and have non-O(1) indexing for strings with the 
 contains-non-BMP-and-BRS-set-to-full-Unicode flag bit.
 
 A fine implementation, but not observable.  Another implementation approach 
 that would preserve O(1) indexing would be to simply have two or three 
 different internal string representations with 1, 2, or 4 byte internal 
 characters.  (You can automatically pick the needed character size when the 
 string is created because string are immutable and created with their value). 
  A not-quite O(1) approach would segment strings into substring spans using 
 such an representation.   Representation choice probably depends a lot on 
 what you think are the most common use cases.  If it is string processing in 
 JS then a fast representations is probably what you want to choose.  If it is 
 just passing text  that is already UTF-8 or UTF-16  encoded from inputs to 
 output then a representation that minimizing transcoding would probably be a 
 higher priority.


One way in which the proposal under discussion seems to differ from the 
previous strawman is in the behavior arising from concatenation of strings 
ending/beginning with a surrogate hi and lo element.
How do we want to handle how do we want to handle unpaired UTF-16 surrogates in 
a full-unicode string?  I can see three options:

1) Prohibit values from strings that do not map to valid unicode characters 
(either throw an exception, or replace with the unicode replacement character).
2) Allow invalid unicode characters in strings, and preserve them over 
concatenation – (\uD800 + \uDC00).length == 2.
3) Allow invalid unicode characters in strings, but allow surrogate pairs to 
fuse over concatenation – (\uD800 + \uDC00).length == 1.

It seems desirable for full-unicode strings to logically be a sequence of 
unicode characters, stored and processed in a encoding-agnostic manner.  option 
3 would seem to violate that, exposing the underlying UTF-16 implementation.  
It also loses a distributive property of .length over concatenation that I 
believe is true in ES5 for strings, in that currently for all strings s1  s2:
s1.length + s2.length == (s1 + s2).length
However if we allow concatenation to fuse surrogate pairs into a single 
character (e.g. s1 = \uD800, s2 = \uDC00) this will no longer be true.

I guess I wonder if it's worth considering either options 1) or 2) – either 
prohibiting invalid unicode characters in strings, or consider something closer 
to the previous strawman, where string storage is defined to be 32-bit (with a 
BRS that instead of changing iteration would change string creation, 
introducing an implicit UTF16-UTF32 conversion).

cheers,
G.

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Cameron McCormack wrote:

Brendan Eich:
 To hope to make this sideshow beneficial to all the cc: list, what do
 DOM specs use to talk about uint16 units vs. code points?

I say code unit as a shorter way of saying 16 bit unsigned integer 
code unit


  http://dev.w3.org/2006/webapi/WebIDL/#dfn-code-unit

(which DOM4 also links to) and then just code point to refer to 21 
bit numbers that might correspond to a Unicode character, which you 
can see used in


  http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode


Well then, you are one up on ECMA-262, and from Mark Davis's message 
using canonical Unicode terms. We shall strive to align terms.


Here's another q for the DOM folks and others using WebIDL: is extending 
the DOM and other specs to respect the BRS and support full Unicode 
conceivable from where you sit? Desirable? Thanks.


/be

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Brendan Eich

Gavin Barraclough wrote:
One way in which the proposal under discussion seems to differ from 
the previous strawman is in the behavior arising from concatenation of 
strings ending/beginning with a surrogate hi and lo element.
How do we want to handle how do we want to handle unpaired UTF-16 
surrogates in a full-unicode string?  I can see three options:


1) Prohibit values from strings that do not map to valid unicode 
characters (either throw an exception, or replace with the unicode 
replacement character).
2) Allow invalid unicode characters in strings, and preserve them over 
concatenation – (\uD800 + \uDC00).length == 2.
3) Allow invalid unicode characters in strings, but allow surrogate 
pairs to fuse over concatenation – (\uD800 + \uDC00).length == 1.


It seems desirable for full-unicode strings to logically be a sequence 
of unicode characters, stored and processed in a encoding-agnostic 
manner.  option 3 would seem to violate that, exposing the underlying 
UTF-16 implementation.  It also loses a distributive property of 
.length over concatenation that I believe is true in ES5 for strings, 
in that currently for all strings s1  s2:

s1.length + s2.length == (s1 + s2).length
However if we allow concatenation to fuse surrogate pairs into a 
single character (e.g. s1 = \uD800, s2 = \uDC00) this will no 
longer be true.


I guess I wonder if it's worth considering either options 1) or 2) – 
either prohibiting invalid unicode characters in strings, or consider 
something closer to the previous strawman, where string storage is 
defined to be 32-bit (with a BRS that instead of changing iteration 
would change string creation, introducing an implicit UTF16-UTF32 
conversion).


Great post. I agree 3 is not good. I was thinking based on today's 
exchanges that the BRS being set to full Unicode *could* mean that 
\u is illegal and you *must* use \u{...} to write Unicode *code 
points* (not code units).


Last year we dispensed with the binary data hacking in strings use-case. 
I don't see the hardship. But rather than throw exceptions on 
concatenation I would simply eliminate the ability to spell code units 
with \u escapes. Who's with me?


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Allen Wirfs-Brock

On Feb 19, 2012, at 6:54 PM, Gavin Barraclough wrote:

 On Feb 19, 2012, at 3:13 PM, Allen Wirfs-Brock wrote:
 My implementor's bias is showing, because I expect many engines would use 
 UTF-16 internally and have non-O(1) indexing for strings with the 
 contains-non-BMP-and-BRS-set-to-full-Unicode flag bit.
 
 A fine implementation, but not observable.  Another implementation approach 
 that would preserve O(1) indexing would be to simply have two or three 
 different internal string representations with 1, 2, or 4 byte internal 
 characters.  (You can automatically pick the needed character size when the 
 string is created because string are immutable and created with their 
 value).  A not-quite O(1) approach would segment strings into substring 
 spans using such an representation.   Representation choice probably depends 
 a lot on what you think are the most common use cases.  If it is string 
 processing in JS then a fast representations is probably what you want to 
 choose.  If it is just passing text  that is already UTF-8 or UTF-16  
 encoded from inputs to output then a representation that minimizing 
 transcoding would probably be a higher priority.
 
 
 One way in which the proposal under discussion seems to differ from the 
 previous strawman is in the behavior arising from concatenation of strings 
 ending/beginning with a surrogate hi and lo element.
 How do we want to handle how do we want to handle unpaired UTF-16 surrogates 
 in a full-unicode string?  I can see three options:
 
 1) Prohibit values from strings that do not map to valid unicode characters 
 (either throw an exception, or replace with the unicode replacement 
 character).
 2) Allow invalid unicode characters in strings, and preserve them over 
 concatenation – (\uD800 + \uDC00).length == 2.
 3) Allow invalid unicode characters in strings, but allow surrogate pairs to 
 fuse over concatenation – (\uD800 + \uDC00).length == 1.
 
 It seems desirable for full-unicode strings to logically be a sequence of 
 unicode characters, stored and processed in a encoding-agnostic manner.  
 option 3 would seem to violate that, exposing the underlying UTF-16 
 implementation.  It also loses a distributive property of .length over 
 concatenation that I believe is true in ES5 for strings, in that currently 
 for all strings s1  s2:
   s1.length + s2.length == (s1 + s2).length
 However if we allow concatenation to fuse surrogate pairs into a single 
 character (e.g. s1 = \uD800, s2 = \uDC00) this will no longer be true.
 
 I guess I wonder if it's worth considering either options 1) or 2) – either 
 prohibiting invalid unicode characters in strings, or consider something 
 closer to the previous strawman, where string storage is defined to be 32-bit 
 (with a BRS that instead of changing iteration would change string creation, 
 introducing an implicit UTF16-UTF32 conversion).

I think 2) is the only reasonable alternative.

I don't think 1) would be a very good choice, if for no other reason the set of 
valid unicode characters is a moving target that you wouldn't want to hardwire 
into either the ES specification or implementations.

More importantly, some applications require string processing strings 
containing invalid unicode characters.  In particular, any sort of transcoders 
between character sets requires this. If you want to take a full unicode 
string, convert it to UTF-16 and then output it, you may generate an 
intermediate strings with elements that contain individual high and low 
surrogate codes.  If you were transcoding to a non-Unicode character set any 
value might be possible.

I really don't think any Unicode semantics should be build into the basic 
string representation.  We need to decide on a max element size and Unicode 
motivates 21 bits, but it could be 32-bits.  Personally, I've lived through 
enough address space exhaustion episodes in my career be skeptical of small 
values like 2^21 being good enough for the long term.

Allen___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Bill Frantz

On 2/19/12 at 21:45, al...@wirfs-brock.com (Allen Wirfs-Brock) wrote:

I really don't think any Unicode semantics should be build into 
the basic string representation.  We need to decide on a max 
element size and Unicode motivates 21 bits, but it could be 
32-bits.  Personally, I've lived through enough address space 
exhaustion episodes in my career be skeptical of small values 
like 2^21 being good enough for the long term.


Can we future-proof any limit an implementation may chose by 
saying that all characters whose code point is too large for a 
particular implementation must be replaced by an invalid 
character code point (which fits into the implementation's 
representation size) on input? An implementation which chooses 
21 bits as the size will become obsolete when Unicode characters 
that need 22 bits are defined. However it will still work with 
characters that fit in 21 bits, and will do something rational 
with ones that do not. Users who need characters in the over 21 
bit set will be encouraged to upgrade.


Cheers - Bill

---
Bill Frantz| If the site is supported by  | Periwinkle
(408)356-8506  | ads, you are the product.| 16345 
Englewood Ave
www.pwpconsult.com |  | Los Gatos, 
CA 95032


___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Allen Wirfs-Brock

On Feb 19, 2012, at 7:52 PM, Brendan Eich wrote:

 Gavin Barraclough wrote:
 One way in which the proposal under discussion seems to differ from the 
 previous strawman is in the behavior arising from concatenation of strings 
 ending/beginning with a surrogate hi and lo element.
 How do we want to handle how do we want to handle unpaired UTF-16 surrogates 
 in a full-unicode string?  I can see three options:
 
 1) Prohibit values from strings that do not map to valid unicode characters 
 (either throw an exception, or replace with the unicode replacement 
 character).
 2) Allow invalid unicode characters in strings, and preserve them over 
 concatenation – (\uD800 + \uDC00).length == 2.
 3) Allow invalid unicode characters in strings, but allow surrogate pairs to 
 fuse over concatenation – (\uD800 + \uDC00).length == 1.
 
 It seems desirable for full-unicode strings to logically be a sequence of 
 unicode characters, stored and processed in a encoding-agnostic manner.  
 option 3 would seem to violate that, exposing the underlying UTF-16 
 implementation.  It also loses a distributive property of .length over 
 concatenation that I believe is true in ES5 for strings, in that currently 
 for all strings s1  s2:
 s1.length + s2.length == (s1 + s2).length
 However if we allow concatenation to fuse surrogate pairs into a single 
 character (e.g. s1 = \uD800, s2 = \uDC00) this will no longer be true.
 
 I guess I wonder if it's worth considering either options 1) or 2) – either 
 prohibiting invalid unicode characters in strings, or consider something 
 closer to the previous strawman, where string storage is defined to be 
 32-bit (with a BRS that instead of changing iteration would change string 
 creation, introducing an implicit UTF16-UTF32 conversion).
 
 Great post. I agree 3 is not good. I was thinking based on today's exchanges 
 that the BRS being set to full Unicode *could* mean that \u is 
 illegal and you *must* use \u{...} to write Unicode *code points* (not code 
 units).
 
 Last year we dispensed with the binary data hacking in strings use-case. I 
 don't see the hardship. But rather than throw exceptions on concatenation I 
 would simply eliminate the ability to spell code units with \u escapes. 
 Who's with me?

I think we need to be careful not to equate the syntax of ES string literals 
with the actual encoding space of string elements.  Whether you say \ud800 or 
\u{00d800}, or call a function that does full-unicode to UTF-16 encoding, or 
simply create a string from file contents you may end up with string elements 
containing upper or lower half surrogates.Eliminating the \u syntax 
really doesn't change anything regarding actual string processing. 

What it might do, however, is eliminate the ambiguity about the intended 
meaning of  \uD800\uDc00 in legacy code.  If full unicode string mode only 
supported \u{} escapes then existing code that uses \u would have to be 
updated before it could be used in that mode.  That might be a good thing.

Allen









 
 /be
 

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-19 Thread Gavin Barraclough
On Feb 19, 2012, at 10:05 PM, Allen Wirfs-Brock wrote:

 Great post. I agree 3 is not good. I was thinking based on today's exchanges 
 that the BRS being set to full Unicode *could* mean that \u is 
 illegal and you *must* use \u{...} to write Unicode *code points* (not 
 code units).
 
 Last year we dispensed with the binary data hacking in strings use-case. I 
 don't see the hardship. But rather than throw exceptions on concatenation I 
 would simply eliminate the ability to spell code units with \u 
 escapes. Who's with me?
 
 I think we need to be careful not to equate the syntax of ES string literals 
 with the actual encoding space of string elements.  Whether you say \ud800 
 or \u{00d800}, or call a function that does full-unicode to UTF-16 
 encoding, or simply create a string from file contents you may end up with 
 string elements containing upper or lower half surrogates.Eliminating the 
 \u syntax really doesn't change anything regarding actual string 
 processing. 
 
 What it might do, however, is eliminate the ambiguity about the intended 
 meaning of  \uD800\uDc00 in legacy code.  If full unicode string mode 
 only supported \u{} escapes then existing code that uses \u would have to 
 be updated before it could be used in that mode.  That might be a good thing.

Ah, this is a good point.  I was going to ask whether it would be inconsistent 
to deprecate \u but not \xXX, since both could just be considered shorthand 
for \u{...}, but this is a good practical reason why it matters more for \u 
(and I can imagine there may be complaints if we take \xXX away!).

So, just to clarify,
var s1 = \u{0d800}\u{0dc00};
var s2 = String.fromCharCode(0xd800) + String.fromCharCode(0xdc00);
s1.length === 2; // true
s2.length === 2; // true
s1 === s2; // true
Does this sound like the expected behavior?

Also, what would happen to String.fromCharCode?

1) Leave this unchanged, it would continue to truncate the input with ToUint16?
2) Change its behavior to allow any code point (maybe switch to ToUint32, or 
ToInteger, and throw a RangeError for input  0x10?).
3) Make it sensitive to the state of the corresponding global object's BRS.

If we were to leave it unchanged, using ToUInt16, then I guess we would need a 
new String.fromCodePoint function, to be able to create strings for non-BMP 
characters?  Presumably we would then want a new String.codePointAt function, 
for symmetry?  This would also raise a question of what String.charCodeAt 
should return for code points outside of the Uint16 range – should it return 
the actual value, or ToUint16 of the code point to mirror the truncation 
performed by fromCharCode?

I guess my preference here would be to go with option 3 – tie the potentially 
breaking change to the BRS, but no need for new interface.

cheers,
G.

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss