Re: New full Unicode for ES6 idea

2012-02-21 Thread Norbert Lindenberg
Second part: the BRS.

I'm wondering how development and deployment of existing full-Unicode software 
will play out in the presence of a Big Red Switch. Maybe I'm blind and there 
are ways to simplify the process, but this is how I imagine it.

Let's start with a bit of code that currently supports full Unicode by hacking 
around ECMAScript's limitations:
https://github.com/roozbehp/yui3-gallery/blob/master/src/gallery-intl-bidi/js/intl-bidi.js

To support applications running in a BRS-on environment, Roozbeh would have to 
create a parallel version of the module that (a) takes advantage of regular 
expressions that finally support supplementary characters and (b) uses the new 
Unicode escape syntax instead of the old one. The parallel version has to be 
completely separate because a BRS-on environment would reject the old Unicode 
escapes and an ES5/BRS-off environment would reject the new Unicode escapes.

To get the code tested, he also has to create a parallel version of the test 
cases. The parallel version would be functionally identical but set up a BRS-on 
environment and use the new Unicode escape syntax instead of the old one. The 
parallel version has to be completely separate because a BRS-on environment 
would reject the old Unicode escapes and an ES5/BRS-off environment would 
reject the new Unicode escapes. Fortunately the test cases are simple.

Then he has to figure out how the two separate versions of the module will get 
loaded by clients. It's a YUI module, and the YUI loader already has the 
ability to look at several parameters to figure out what to load (minimized vs. 
debug version, localized resource bundles, etc.), so maybe the BRS should be 
another parameter? But the YUI team has a long to-do list, so in the meantime 
the module gets two separate names, and the client has to figure out which one 
to request.

The first client picking up the new version is another, bigger library. As a 
library it doesn't control the BRS, so it has to be able to run with both 
BRS-on and BRS-off. So it has to check the BRS and load the appropriate version 
of the intl-bidi module at runtime. This means, it also has to be tested in 
both environments. Its test cases are not simple. So now it needs modifications 
to the test framework to run the test suite twice, once with BRS-on and once 
with BRS-off.

An application using the library and thus the intl-bidi module decides to take 
the plunge and switch to BRS-on. It doesn't do text processing itself (that's 
what libraries are for), and it doesn't use Unicode escapes, so no code 
changes. But when it throws the switch, exceptions get thrown. It turns out 
that 3 of the 50 JavaScript files loaded during startup use old Unicode 
escapes. One of them seems to do something that might affect supplementary 
characters; for the other two apparently the developers just felt safer 
escaping all non-ASCII characters. The developers of the application don't 
actually know anything about the scripts - they got loaded indirectly by apps, 
ads, and analytics software used by the application. The developers try to find 
out whom they'll have to educate about the BRS to get this resolved.

OK - migrations are hard. But so far most participants have only seen 
additional work, no benefits. How long will this take? When will it end? When 
will browsers make BRS-on the default, let alone eliminate the switch? When can 
Roozbeh abandon his original version? Where's the blue button?

The thing to keep in mind is that most code doesn't need to know anything about 
supplementary characters. The beneficiaries of the switch are only the 
implementors of functions that do need to know, and even they won't really 
benefit until the switch is permanently on (at least for all their clients). It 
seems the switch puts a new burden on many that so far have been rightfully 
oblivious to supplementary characters.

Norbert



On Feb 19, 2012, at 0:33 , Brendan Eich wrote:

[snip]

> Allen's strawman from last year, 
> http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings,
>  proposed a brute-force change to support full Unicode (albeit with too many 
> hex digits allowed in "\u{...}"), observing that "There are very few places 
> where the ECMAScript specification has actual dependencies upon the size of 
> individual characters so the compatibility impact of supporting full Unicode 
> is quite small." But two problems remained:
> 
> P1. As Allen wrote, "There is a larger impact on actual implementations", and 
> no implementors that I can recall were satisfied that the cost was 
> acceptable. It might be, we just didn't know, and there are enough signs of 
> high cost to create this concern.
> 
> P2. The change is not backward compatible. In JS today, one read a string s 
> from somewhere and hard-code, e.g., s.indexOf("0xd800" to find part of a 
> surrogate pair, then advance to the next-indexed uint16 unit and read the 
> other half, then combine to 

Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich
On Feb 21, 2012, at 6:05 PM, Norbert Lindenberg 
 wrote:

> I'll reply to Brendan's proposal in two parts: first about the goals for 
> supplementary character support, second about the BRS.
> 
>> Full 21-bit Unicode support means all of:
>> 
>> * indexing by characters, not uint16 storage units;
>> * counting length as one greater than the last index; and
>> * supporting escapes with (up to) six hexadecimal digits.
> 
> For me, full 21-bit Unicode support has a different priority list.
> 
> First come the essentials: Regular expressions; functions that interpret 
> strings; the overall sense that all Unicode characters are supported.
> 
> 1) Regular expressions must recognize supplementary characters as atomic 
> entities, and interpret them according to Unicode semantics.

Sorry to have been unclear. In my proposal this follows from the first two 
bullets.


> 2) Built-in functions that interpret strings have to recognize supplementary 
> characters as atomic entities and interpret them according to their Unicode 
> semantics.

Ditto.


> 3) It must be clear that the full Unicode character set is allowed and 
> supported. 

Absolutely.


> Only after these essentials come the niceties of String representation and 
> Unicode escapes:
> 
> 4) 1 String element to 1 Unicode code point is indeed a very nice and 
> desirable relationship. Unlike Java, where binary compatibility between 
> virtual machines made a change from UTF-16 to UTF-32 impossible, JavaScript 
> needs to be compatible only at the source code level - or maybe, with a BRS, 
> not even that.

Right!


> 5) If we don't go for UTF-32, then there should be a few functions to 
> simplify access to strings in terms of code points, such as 
> String.fromCodePoint, String.prototype.codePointAt.

Those would help smooth out different BRS settings, indeed.


> 6) I strongly prefer the use of plain characters over Unicode escapes in 
> source code, because plain text is much easier to read than sequences of hex 
> values. However, the need for Unicode escapes is greater in the space of 
> supplementary characters because here we often have to reference characters 
> for which our operating systems don't have glyphs yet. And \u{1D11E} 
> certainly makes it easier to cross-reference a character than \uD834\uDD1E. 
> The new escape syntax therefore should be on the list, at low priority.

Allen and I were just discussing this as a desirable mini- strawman of its own, 
which Allen will write up for consideration at the next meeting.

We will also discuss the BRS . Did you have some thoughts on it?


> I think it would help if other people involved in this discussion also 
> clarified what exactly their requirements are for "full Unicode support".

Again, apologies for not being explicit. I model the string methods as 
self-hosted using indexing and .length in straightforward ways. HTH,

/be

> 
> Norbert
> 
> 
> 
> On Feb 19, 2012, at 0:33 , Brendan Eich wrote:
> 
>> Once more unto the breach, dear friends!
>> 
>> ES1 dates from when Unicode fit in 16 bits, and in those days, nickels had 
>> pictures of bumblebees on 'em ("Gimme five bees for a quarter", you'd say 
>> ;-).
>> 
>> Clearly that was a while ago. These days, we would like full 21-bit Unicode 
>> character support in JS. Some (mranney at Voxer) contend that it is a 
>> requirement.
>> 
>> Full 21-bit Unicode support means all of:
>> 
>> * indexing by characters, not uint16 storage units;
>> * counting length as one greater than the last index; and
>> * supporting escapes with (up to) six hexadecimal digits.
>> 
>> ES4 saw bold proposals including Lars Hansen's, to allow implementations to 
>> change string indexing and length incompatibly, and let Darwin sort it out. 
>> I recall that was when we agreed to support "\u{XX}" as an extension for 
>> spelling non-BMP characters.
>> 
>> Allen's strawman from last year, 
>> http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings,
>>  proposed a brute-force change to support full Unicode (albeit with too many 
>> hex digits allowed in "\u{...}"), observing that "There are very few places 
>> where the ECMAScript specification has actual dependencies upon the size of 
>> individual characters so the compatibility impact of supporting full Unicode 
>> is quite small." But two problems remained:
>> 
>> P1. As Allen wrote, "There is a larger impact on actual implementations", 
>> and no implementors that I can recall were satisfied that the cost was 
>> acceptable. It might be, we just didn't know, and there are enough signs of 
>> high cost to create this concern.
>> 
>> P2. The change is not backward compatible. In JS today, one read a string s 
>> from somewhere and hard-code, e.g., s.indexOf("0xd800" to find part of a 
>> surrogate pair, then advance to the next-indexed uint16 unit and read the 
>> other half, then combine to compute some result. Such usage would break.
>> 
>> Example from Allen:
>> 
>> var c = "😁" // where

Re: New full Unicode for ES6 idea

2012-02-21 Thread Norbert Lindenberg
I'll reply to Brendan's proposal in two parts: first about the goals for 
supplementary character support, second about the BRS.

> Full 21-bit Unicode support means all of:
> 
> * indexing by characters, not uint16 storage units;
> * counting length as one greater than the last index; and
> * supporting escapes with (up to) six hexadecimal digits.

For me, full 21-bit Unicode support has a different priority list.

First come the essentials: Regular expressions; functions that interpret 
strings; the overall sense that all Unicode characters are supported.

1) Regular expressions must recognize supplementary characters as atomic 
entities, and interpret them according to Unicode semantics.

Look at the contortions one has to go through currently to describe a simple 
character class that includes supplementary characters:
https://github.com/roozbehp/yui3-gallery/blob/master/src/gallery-intl-bidi/js/intl-bidi.js

Read up on why it has to be done this way, and see to what extremes some people 
are going to make supplementary characters work despite ECMAScript:
http://inimino.org/~inimino/blog/javascript_cset

Now, try to figure out how you'd convert a user-entered string to a regular 
expression such that you can search for the string without case distinction, 
where the string may contain supplementary characters such as "𐐶𐐲𐑌" (Deseret 
for "one").

Regular expressions matter a lot here because, if done properly, they eliminate 
much of the need for iterating over strings manually.

2) Built-in functions that interpret strings have to recognize supplementary 
characters as atomic entities and interpret them according to their Unicode 
semantics. The list of functions in ES5 that violate this principle is actually 
rather short: Besides the String functions relying on regular expressions 
(match, replace, search, split), they're the String case conversion functions 
(toLowerCase, toLocaleLowerCase, toUpperCase, toLocaleUpperCase) and the 
relational comparison for strings (11.8.5). But the principle is also important 
for new functionality being considered for ES6 and above.

3) It must be clear that the full Unicode character set is allowed and 
supported. This means at least getting rid of the reference to UCS-2 (clause 2) 
and the bizarre equivalence between characters and UTF-16 code units (clause 
6). ECMAScript has already defined several ways to create UTF-16 strings 
containing supplementary characters (parsing UTF-8 source; using Unicode 
escapes for surrogate pairs), and lets applications freely pass around such 
strings. Browsers have surrounded ECMAScript implementations with text input, 
text rendering, DOM APIs, and XMLHTTPRequest with full Unicode support, and 
generally use full UTF-16 to exchange text with their ECMAScript subsystem. 
Developers have used this to build applications that support supplementary 
characters, hacking around the remaining gaps in ECMAScript as seen above. But, 
as in the bug report that Brendan pointed to this morning 
(http://code.google.com/p/v8/issues/detail?id=761), the mention of UCS-2 is 
still used by some to excuse bugs.


Only after these essentials come the niceties of String representation and 
Unicode escapes:

4) 1 String element to 1 Unicode code point is indeed a very nice and desirable 
relationship. Unlike Java, where binary compatibility between virtual machines 
made a change from UTF-16 to UTF-32 impossible, JavaScript needs to be 
compatible only at the source code level - or maybe, with a BRS, not even that.

5) If we don't go for UTF-32, then there should be a few functions to simplify 
access to strings in terms of code points, such as String.fromCodePoint, 
String.prototype.codePointAt.

6) I strongly prefer the use of plain characters over Unicode escapes in source 
code, because plain text is much easier to read than sequences of hex values. 
However, the need for Unicode escapes is greater in the space of supplementary 
characters because here we often have to reference characters for which our 
operating systems don't have glyphs yet. And \u{1D11E} certainly makes it 
easier to cross-reference a character than \uD834\uDD1E. The new escape syntax 
therefore should be on the list, at low priority.


I think it would help if other people involved in this discussion also 
clarified what exactly their requirements are for "full Unicode support".

Norbert



On Feb 19, 2012, at 0:33 , Brendan Eich wrote:

> Once more unto the breach, dear friends!
> 
> ES1 dates from when Unicode fit in 16 bits, and in those days, nickels had 
> pictures of bumblebees on 'em ("Gimme five bees for a quarter", you'd say ;-).
> 
> Clearly that was a while ago. These days, we would like full 21-bit Unicode 
> character support in JS. Some (mranney at Voxer) contend that it is a 
> requirement.
> 
> Full 21-bit Unicode support means all of:
> 
> * indexing by characters, not uint16 storage units;
> * counting length as one greater than the last index; and
> * su

Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich
Thanks, all! That's a relief to know, six bytes always seemed to long 
but my reptile coder brain was also reptile-coder-lazy and I never dug 
into it.


/be

Phillips, Addison wrote:

Hi Mark, thanks for this post.

Mark Davis ☕ wrote:

UTF-8 represents a code point as 1-4 8-bit code units

"1-6".


No. 1 to *4*. Five and six byte "UTF-8" sequences are illegal and invalid.


UTF-16 represents a code point  as 2 or 4 16-bit code units

"1 or 2".


Yes, 1 or 2 16-bit code units (that's 2 or 4 bytes, of course).

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.





___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: New full Unicode for ES6 idea

2012-02-21 Thread Phillips, Addison
> 
> Hi Mark, thanks for this post.
> 
> Mark Davis ☕ wrote:
> > UTF-8 represents a code point as 1-4 8-bit code units
> 
> "1-6".

No. 1 to *4*. Five and six byte "UTF-8" sequences are illegal and invalid. 

> 
> > UTF-16 represents a code point  as 2 or 4 16-bit code units
> 
> "1 or 2".

Yes, 1 or 2 16-bit code units (that's 2 or 4 bytes, of course).

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.




___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Tab Atkins Jr.
On Tue, Feb 21, 2012 at 3:11 PM, Brendan Eich  wrote:
> Hi Mark, thanks for this post.
> Mark Davis ☕ wrote:
>>
>> UTF-8 represents a code point as 1-4 8-bit code units
>
> "1-6".
...
> Lock up your encoders, I am so not a Unicode guru but this is what my
> reptile coder brain remembers.

Only theoretically.  UTF-8 has been locked down to the same range that
UTF-16 has (RFC 3629), so the largest real character you'll see is 4
bytes, as that gives you exactly 21 bits of data.

~TJ
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich

Hi Mark, thanks for this post.

Mark Davis ☕ wrote:

UTF-8 represents a code point as 1-4 8-bit code units


"1-6".


UTF-16 represents a code point  as 2 or 4 16-bit code units


"1 or 2".

Lock up your encoders, I am so not a Unicode guru but this is what my 
reptile coder brain remembers.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Allen Wirfs-Brock

On Feb 21, 2012, at 7:37 AM, Brendan Eich wrote:

> Brendan Eich wrote:
>> in open-source browsers and JS engines that use uint16 vectors internally
> 
> Sorry, that reads badly. All I meant is that I can't tell what closed-source 
> engines do, not that they do not comply with ECMA-262 combined with other web 
> standards to have the same observable effect, e.g. Allen's example:

A quick scan of http://code.google.com/p/v8/issues/detail?id=761 suggests that 
there may be more variability among current browsers than we thought.  I 
haven't tried my original test case in Chrome of IE9 but the discussion in this 
bug report suggests that their behavior may currently be different from FF.

> 
> var c = "😁" // where the single character between the quotes is the Unicode 
> character U+1f638
> 
> c.length == 2;
> c === "\ud83d\ude38"; //the two character UTF-16 encoding of 0x1f683
> c.charCodeAt(0) == 0xd83d;
> c.charCodeAt(1) == 0xd338;
> 
> Still no BRS to set, we need one if we want a full-Unicode outcome (c.length 
> == 1, etc.).
> 
> /be
> ___
> es-discuss mailing list
> es-discuss@mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
> 

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich

Phillips, Addison wrote:


Because it has always been possible, it’s difficult to say how many 
scripts have transported byte-oriented data by “punning” the data into 
strings. Actually, I think this is more likely to be truly binary data 
rather than text in some non-Unicode character encoding, but anything 
is possible, I suppose. This could include using non-character values 
like “FFFE”, “” in addition to the surrogates. A BRS-running 
implementation would break a script that relied on String being a 
sequence of 16-bit unsigned integer values with no error checking.




Allen's view of the BRS-enabled semantics would have 16-bit "GIGO" 
without exceptions -- you'd be storing 16-bit values, whatever their 
source (including "\u" literals spelling invalid characters and 
unmatched surrogates) in at-least-21-bit elements of strings, and 
reading them back.


My concern and reason for advocating early or late errors on shenanigans 
was that people today writing surrogate pais literally and then taking 
extra pains in JS or C++ (whatever the host language might be) to 
process them as single code points and characters would be broken by the 
BRS-enabled behavior of separating the parts into distinct code points.


But that's pessimistic. It could happen, but OTOH anyone coding 
surrogate pairs might want them to read back piece-wise when indexing. 
In that case what Allen proposes, storing each formerly 16-bit code 
unit, however expressed, in the wider 21-or-more-bits unit, and reading 
back likewise, would "just work".


Sorry if this is all obvious. Mainly I want to throw in my lot with 
Allen's exception-free literal/constructor approach. The encoding APIs 
should throw on invalid Unicode but literals and strings as immutable 
16-bit storage buffers should work as today.


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: New full Unicode for ES6 idea

2012-02-21 Thread Phillips, Addison
Because it has always been possible, it’s difficult to say how many scripts 
have transported byte-oriented data by “punning” the data into strings. 
Actually, I think this is more likely to be truly binary data rather than text 
in some non-Unicode character encoding, but anything is possible, I suppose. 
This could include using non-character values like “FFFE”, “” in addition 
to the surrogates. A BRS-running implementation would break a script that 
relied on String being a sequence of 16-bit unsigned integer values with no 
error checking.


One of my examples, GB 18030, is a four-byte encoding and a Chinese government 
standard.  It is a mapping onto Unicode, but this mapping is table-driven 
rather than algorithm driven like the UTF-* transport formats.  To provide a 
single example, Unicode 0x2259 maps onto GB 18030 0x8136D830.
AP> GB 18030 is more complex than that. Not all characters are four-byte, for 
example. As a multibyte encoding, you might choose to “pun” GB 18030 into a 
String as 81 36 d8 30. There isn’t much attraction to punning it into 0x8136 
0xd830, but, as noted above, someone might be foolish enough to try it ;-). 
Scripts that rely on this probably break under BRS.

You're right about Big5 being byte-oriented, maybe this was a bad example, 
although it is a double-byte charset. It works by putting ASCII down low making 
bytes above 0x7f escapes into code pages dereferenced by the next byte.  Each 
code point is encoded with one or two bytes, never more.  If I were developing 
with Big5 in JS, I would store the byte stream 4a 4b d8 00 c1 c2 4c as  004a 
004b d800 c1c2 004c.  This would allow me to use JS regular expressions and so 
on.
Not exactly. The trailing bytes in Big5 start at 0x40, for example. But it is 
certainly the case that some multibyte characters in Big5 happen to have the 
same byte-pair as a surrogate code point (when considered as a pair of bytes) 
or other non-character in the Unicode BMP, and one might (he says, squinting 
really hard) want to do as you suggest and record the multibyte sequence as a 
single code point.
But the data does not need to arrive from C API -- it could easily be delivered 
by an XHR request where, say, the remote end dumps database rows into a 
transport format based around evaluating JS string literals (like JSON).
Allowing isolated invalid sequences isn’t actually the problem, if you think 
about it. Yes, the data is bad and yes you can’t view it cleanly. But you can 
do whatever you need to on it.
The problem is when you intend to store two values that end up as a single 
character. If I have a string with code points “f235 5e7a e040 d800”, the d800 
does no particular harm. The problem is: if I construct a BRS string using that 
sequence and then concatenate the sequence “dc00 a053 3254” onto it, the 
resulting string is only *six* characters long, rather than the expected seven, 
since presumably the d800 dc00 pair turns into U+1.
Addison
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: New full Unicode for ES6 idea

2012-02-21 Thread Phillips, Addison
> 
> I meant ECMA-262 punts source normalization upstream in the spec pipeline
> that runs parallel to the browser's loading-the-URL | processing-what-was-
> loaded pipeline. ECMA-262 is concerned only with its little slice of 
> processing
> heaven.

Yep. One of the problems is that the source script may not be using a Unicode 
encoding or may be using a Unicode encoding and be serialized in a 
non-normalized form. Your slice of processing heaven treats 
Unicode-normalization-equivalent-yet-different-codepoint-sequence tokens as 
unequal. Not that this is a bad thing.
> 
> > By contrast, providing a method for normalizing strings would be useful.
> /summon Norbert.

(hides the breakables, listens for thunder)

Addison

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich

Phillips, Addison wrote:

Normalization happens to source upstream of the JS engine. Here I'll call on a
designated Unicode hitter. ;-)



I agree that Unicode Normalization shouldn't happen automagically in the JS engine. I rather doubt that 
"normalization happens to source upstream of the JS engine", unless by "upstream" you 
mean "best see to the normalization yourself".


Yes ;-).

I meant ECMA-262 punts source normalization upstream in the spec 
pipeline that runs parallel to the browser's loading-the-URL | 
processing-what-was-loaded pipeline. ECMA-262 is concerned only with its 
little slice of processing heaven.



By contrast, providing a method for normalizing strings would be useful.

/summon Norbert.

/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


RE: New full Unicode for ES6 idea

2012-02-21 Thread Phillips, Addison
> 
> Normalization happens to source upstream of the JS engine. Here I'll call on a
> designated Unicode hitter. ;-)
> 

I agree that Unicode Normalization shouldn't happen automagically in the JS 
engine. I rather doubt that "normalization happens to source upstream of the JS 
engine", unless by "upstream" you mean "best see to the normalization yourself".

By contrast, providing a method for normalizing strings would be useful.

Addison
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich

Brendan Eich wrote:

in open-source browsers and JS engines that use uint16 vectors internally


Sorry, that reads badly. All I meant is that I can't tell what 
closed-source engines do, not that they do not comply with ECMA-262 
combined with other web standards to have the same observable effect, 
e.g. Allen's example:


var c = "😁" // where the single character between the quotes is the 
Unicode character U+1f638


c.length == 2;
c === "\ud83d\ude38"; //the two character UTF-16 encoding of 0x1f683
c.charCodeAt(0) == 0xd83d;
c.charCodeAt(1) == 0xd338;

Still no BRS to set, we need one if we want a full-Unicode outcome 
(c.length == 1, etc.).


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich

Wes Garland wrote:
On 21 February 2012 00:03, Brendan Eich > wrote:


Ball one. :-P


If I hit the batter, does he get to first base?


Walk, yes (http://en.wikipedia.org/wiki/Hit_by_pitch).

We still haven't talked about equality and normalization, I suppose 
that can wait.


Allen's point in this last bit of the thread is that we don't need to 
interfere with bits stuffed into code units today, so we shouldn't 
tomorrow when units become as wide as (or wider than) code points. GIGO, 
and equality is memcmp (if you mean == and ===).


Normalization happens to source upstream of the JS engine. Here I'll 
call on a designated Unicode hitter. ;-)


/be
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Brendan Eich

Andrew Oakley wrote:

On 02/20/12 16:47, Brendan Eich wrote:

>  Andrew Oakley wrote:

>>  Issues only arise in code that tries to treat a string as an array of
>>  16-bit integers, and I don't think we should be particularly bothered by
>>  performance of code which misuses strings in this fashion (but clearly
>>  this should still work without opt-in to new string handling).
>  
>  This is all strings in JS and the DOM, today.
>  
>  That is, we do not have any measure of code that treats strings as

>  uint16s, forges strings using "\u", etc. but the ES and DOM specs
>  have allowed this for>  14 years. Based on bitter experience, it's
>  likely that if we change by fiat to 21-bit code points from 16-bit code
>  units, some code on the Web will break.


Sorry, I don't think I was particularly clear.  The point I was trying
to make is that we can*pretend*  that code points are 16-bit but
actually use a 21-bit representation internally.


So far, that's like Allen's proposal from last year 
(http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings). 
But you didn't say how iteration (indexing and .length) work.



If content requests
proper Unicode support we simply switch to allowing 21-bit code-points
and stop encoding characters outside the BMP using surrogate pairs
(because the characters now fit in a single code point).


How does content request proper Unicode support? Whatever that gesture 
is, it's big and red ;-). But we don't have such a switch or button to 
press like that, yet.


If a .js or .html file as fetched from a server has a UTF-8 encoding, 
indeed non-BMP characters in string literals will be transcoded in 
open-source browsers and JS engines that use uint16 vectors internally, 
but each part of the surrogate pair will take up one element in the 
uint16 vector. Let's take this now as a "content request" to use full 
Unicode. But the .js file was developed 8 years ago and assumes two code 
units, not one. It hardcodes for that assumption, somehow (indexing, 
.length exact value, indexOf('\ud800'), etc.). It is now broken.


And non-literal non-BMP characters won't be helped by transcoding 
differently when the .js or .html file is fetched. They'll just change 
"size" at runtime.


/be

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Sugar for *.prototype and for calling methods as functions

2012-02-21 Thread Axel Rauschmayer
> There is a proposal for making available existing functions via modules in 
> ES6:
> 
> http://wiki.ecmascript.org/doku.php?id=harmony:modules_standard
> 
> If there are methods missing from this list that can reasonably be
> used as stand-alone functions, then I'm sure nobody will object to
> adding them.


Beautiful, no more constructors-as-poor-man’s-namespaces.

All generic methods could be in such modules, with uncurried `this`. But, with 
generic methods I’m undecided, they also make sense as “static” methods:
http://wiki.ecmascript.org/doku.php?id=strawman:array_statics

-- 
Dr. Axel Rauschmayer
a...@rauschma.de

home: rauschma.de
twitter: twitter.com/rauschma
blog: 2ality.com

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Wes Garland
On 21 February 2012 00:03, Brendan Eich  wrote:

> These are byte-based enodings, no? What is the problem inflating them by
> zero extension to 16 bits now (or 21 bits in the future)? You can't make an
> invalid Unicode character from a byte value.
>

One of my examples, GB 18030, is a four-byte encoding and a Chinese
government standard.  It is a mapping onto Unicode, but this mapping is
table-driven rather than algorithm driven like the UTF-* transport
formats.  To provide a single example, Unicode 0x2259 maps onto GB 18030
0x8136D830.

You're right about Big5 being byte-oriented, maybe this was a bad example,
although it is a double-byte charset. It works by putting ASCII down low
making bytes above 0x7f escapes into code pages dereferenced by the next
byte.  Each code point is encoded with one or two bytes, never more.  If I
were developing with Big5 in JS, I would store the byte stream 4a 4b d8 00
c1 c2 4c as  004a 004b d800 c1c2 004c.  This would allow me to use JS
regular expressions and so on.

Anyway, Big5 punned into JS strings (via a C or C++ API?) is *not* a strong
> use-case for ignoring invalid characters.
>

Agreed - I'm stretching to see if I can stretch far enough to find a real
problem with BRS -- because I really want it.

But the data does not need to arrive from C API -- it could easily be
delivered by an XHR request where, say, the remote end dumps database rows
into a transport format based around evaluating JS string literals (like
JSON).

Ball one. :-P
>

If I hit the batter, does he get to first base?

We still haven't talked about equality and normalization, I suppose that
can wait.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: New full Unicode for ES6 idea

2012-02-21 Thread Andrew Oakley
On 02/20/12 16:47, Brendan Eich wrote:
> Andrew Oakley wrote:
>> Issues only arise in code that tries to treat a string as an array of
>> 16-bit integers, and I don't think we should be particularly bothered by
>> performance of code which misuses strings in this fashion (but clearly
>> this should still work without opt-in to new string handling).
> 
> This is all strings in JS and the DOM, today.
> 
> That is, we do not have any measure of code that treats strings as
> uint16s, forges strings using "\u", etc. but the ES and DOM specs
> have allowed this for > 14 years. Based on bitter experience, it's
> likely that if we change by fiat to 21-bit code points from 16-bit code
> units, some code on the Web will break.

Sorry, I don't think I was particularly clear.  The point I was trying
to make is that we can *pretend* that code points are 16-bit but
actually use a 21-bit representation internally.  If content requests
proper Unicode support we simply switch to allowing 21-bit code-points
and stop encoding characters outside the BMP using surrogate pairs
(because the characters now fit in a single code point).

> And as noted in the o.p. and in the thread based on Allen's proposal
> last year, browser implementations definitely count on representation
> via array of 16-bit integers, with length property or method counting same.
> 
> Breaking the Web is off the table. Breaking implementations, less so.
> I'm not sure why you bring up UTF-8. It's good for encoding and decoding
> but for JS, unlike C, we want string to be a high level "full Unicode"
> abstraction. Not bytes with bits optionally set indicating more bytes
> follow to spell code points.

Yes, I probably shouldn't have brought up UTF-8 (we do store strings
using UTF-8, I was thinking about our own implementation). The intention
was not to "break the web", my comments about issues when strings were
misused were purely *performance* concerns, behaviour would otherwise
remain unchanged (unless full Unicode support had been enabled).

-- 
Andrew Oakley
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Sugar for *.prototype and for calling methods as functions

2012-02-21 Thread Andreas Rossberg
On 21 February 2012 13:59, Brandon Benvie  wrote:
> I would ask as an exploratory idea: is there any interest in, and what
> problems exist with exposing most {Builtin}.prototype.* methods as unbound
> functional {Builtin}.* functions. Or failing that, a more succint expression
> for the following:
>
> Function.prototype.[call/apply].bind({function}).
> Array.prototype.[map/reduce/forEach].call(arraylike, callback)
> Object.set('key', va)

There is a proposal for making available existing functions via modules in ES6:

http://wiki.ecmascript.org/doku.php?id=harmony:modules_standard

If there are methods missing from this list that can reasonably be
used as stand-alone functions, then I'm sure nobody will object to
adding them.

/Andreas
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Sugar for *.prototype and for calling methods as functions

2012-02-21 Thread Brandon Benvie
I would ask as an exploratory idea: is there any interest in, and what
problems exist with exposing most {Builtin}.prototype.* methods as unbound
functional {Builtin}.* functions. Or failing that, a more succint
expression for the following:

Function.prototype.[call/apply].bind({function}).
Array.prototype.[map/reduce/forEach].call(arraylike, callback)
Object.set('key', va)

Basically, JavaScript has incredible usage potential as a functional
language but has almost not built in support in terms of applyable
functions. It teases you with its charms and then gives no direct payout,
instead asking you to put just one more dollar in for the good stuff.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Sugar for *.prototype and for calling methods as functions

2012-02-21 Thread Axel Rauschmayer
The established way of doing this is [].forEach, "".trim, {}.valueOf. I imagine 
that by now, there would be no performance penalty, any more, because most 
engines are aware of this (ab)use. But it is indeed not very 
intention-revealing. It might make sense to wait with this proposal until 
classes are finished, but they probably won’t introduce any changes at this 
level.

On Feb 21, 2012, at 12:16 , Mariusz Nowak wrote:

> 
> Sorry if it already has been picked up (I searched and didn't found anything
> close that).
> 
> In my last months of work with JavaScript what that I miss a lot in ES5
> syntax is:
> 
> 1. Syntax shortcut for '.prototype'. Instead of writing
> String.prototype.trim I'd love to be able to write for example String#trim 
> (it's not proposal, just example how it may look).
> As most native ES methods are generic there are a lot of valid use cases for
> that e.g.:
> 
> Array#foEach.call(listThatsNotArray, fn);
> 
> 2. Syntax sugar for calling method as a function. In following examples I
> just place '@' at end of method that I'd like to be run as function.
> 
> Array#forEach@(listThatsNotArray, fn));
> 
> or 
> 
> trimmedListOfStrings = listOfStrings.map(String#trim@);
> 
> Last example is same as following in ES5:
> 
> trimmedListOfStrings =
> listOfStrings.map(Function.prototype.call.bind(String.prototype.trim));
> 
> This two proposals will make methods easily accessible for some functional
> constructs, and I think might be revolutionary for those who favor such
> functional way of programming.
> 
> Let me know what do you think about that.

-- 
Dr. Axel Rauschmayer
a...@rauschma.de

home: rauschma.de
twitter: twitter.com/rauschma
blog: 2ality.com

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Sugar for *.prototype and for calling methods as functions

2012-02-21 Thread Brandon Benvie
error in the example should be:`string.split("\n").map(String.trim)`
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Re: Sugar for *.prototype and for calling methods as functions

2012-02-21 Thread Brandon Benvie
This request is the very definition of little things that go a long way. I
write a hell of a lot of code that boils down
to Function.prototype.bind(Function.prototype.call/apply,
Somebuiltin.prototype.method). The fact that there's builtin way to
accomplish `string.split("\n").map(String.split)` (split just as an
example) is annoying in how obvious it is that it should work, and how
often I need them. In fact I think there's some modification to
Spidermonkey that has this? Array.* and String.* being functional versions
of the prototype methods.
___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


Sugar for *.prototype and for calling methods as functions

2012-02-21 Thread Mariusz Nowak

Sorry if it already has been picked up (I searched and didn't found anything
close that).

In my last months of work with JavaScript what that I miss a lot in ES5
syntax is:

1. Syntax shortcut for '.prototype'. Instead of writing
String.prototype.trim I'd love to be able to write for example String#trim 
(it's not proposal, just example how it may look).
As most native ES methods are generic there are a lot of valid use cases for
that e.g.:

Array#foEach.call(listThatsNotArray, fn);

2. Syntax sugar for calling method as a function. In following examples I
just place '@' at end of method that I'd like to be run as function.

Array#forEach@(listThatsNotArray, fn));

or 

trimmedListOfStrings = listOfStrings.map(String#trim@);

Last example is same as following in ES5:

trimmedListOfStrings =
listOfStrings.map(Function.prototype.call.bind(String.prototype.trim));

This two proposals will make methods easily accessible for some functional
constructs, and I think might be revolutionary for those who favor such
functional way of programming.

Let me know what do you think about that.

-- 
Mariusz Nowak
https://github.com/medikoo
http://twitter.com/medikoo

-
Mariusz Nowak

https://github.com/medikoo
-- 
View this message in context: 
http://old.nabble.com/Sugar-for-*.prototype-and-for-calling-methods-as-functions-tp33363174p33363174.html
Sent from the Mozilla - ECMAScript 4 discussion mailing list archive at 
Nabble.com.

___
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss