Re: Full Unicode strings strawman

2011-05-19 Thread Brendan Eich
On May 18, 2011, at 3:46 PM, Waldemar Horwat wrote: On 05/16/11 11:11, Allen Wirfs-Brock wrote: I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason. Feed back would be appreciated:

RE: Full Unicode strings strawman

2011-05-19 Thread Shawn Steele
- Java kept their strings encoded exactly as they were (a sequence of 16-bit code units) and provided extra APIs for the cases where you want to extract a code point. Bloaty. ? defining UTF-16 instead of UCS-2 introduces zero bloat. In fact, it pretty much works anyway, it's just not

Re: Full Unicode strings strawman

2011-05-19 Thread Brendan Eich
On May 19, 2011, at 10:27 AM, Shawn Steele wrote: The crucial win of Allen's proposal comes down the road, when someone in a certain locale *can* do s.indexOf(nonBMPChar) and win. s.indexOf(\U+1), Ok, but \U+... does not work today. who cares that it ends up as UTF-16? You can

Re: Full Unicode strings strawman

2011-05-19 Thread Mark S. Miller
On Thu, May 19, 2011 at 9:50 AM, Brendan Eich bren...@mozilla.com wrote: [...] That seems worth considering, rather than s.wideIndexOf(nonBMPChar). Not jumping into the Unicode debate yet. But I did want to nip this terminological possibility in the bud. PLEASE do not refer to non-BMP

RE: Full Unicode strings strawman

2011-05-19 Thread Shawn Steele
The crucial win of Allen's proposal comes down the road, when someone in a certain locale *can* do s.indexOf(nonBMPChar) and win. s.indexOf(\U+1), Ok, but \U+... does not work today. Yes, that would be worth adding (IMO) as a convenience, regardless of whether the backend were UTF-16

Re: Re: Full Unicode strings strawman

2011-05-19 Thread Douglas Crockford
On 11:59 AM, Brendan Eich wrote: But hey, if JS does not need to change then we can avoid trouble and keep on using 16-bit indexing and length. Is this really the best outcome? It may well be. The problem is largely theoretical, and the many offered cures seem to be much worse than the disease.

Re: Full Unicode strings strawman

2011-05-19 Thread Brendan Eich
On May 19, 2011, at 11:18 AM, Douglas Crockford wrote: A more critical need is some form of string.format or quasiliterals. Yes, these are important. On the agenda for next week? Which strawmen? I've had trouble sorting through the quasi-variations on the wiki, and I know I'm not alone. /be

Re: Full Unicode strings strawman

2011-05-19 Thread Mark S. Miller
On Thu, May 19, 2011 at 12:05 PM, Brendan Eich bren...@mozilla.com wrote: On May 19, 2011, at 11:18 AM, Douglas Crockford wrote: A more critical need is some form of string.format or quasiliterals. Yes, these are important. On the agenda for next week? Which strawmen? I've had trouble

Re: Full Unicode strings strawman

2011-05-19 Thread Allen Wirfs-Brock
On May 18, 2011, at 3:46 PM, Waldemar Horwat wrote: 2. Widening characters to 21 bits doesn't really help much. As stated earlier in this thread, you still want to treat clumps of combining characters together with the character to which they combine, worry about various normalized

RE: Full Unicode strings strawman

2011-05-19 Thread Shawn Steele
will be security problems. -Shawn -Original Message- From: es-discuss-boun...@mozilla.org [mailto:es-discuss-boun...@mozilla.org] On Behalf Of Allen Wirfs-Brock Sent: jueves, mayo 19, 2011 12:19 PM To: Waldemar Horwat Cc: es-discuss@mozilla.org Subject: Re: Full Unicode strings strawman On May 18

Re: Full Unicode strings strawman

2011-05-19 Thread Allen Wirfs-Brock
On May 19, 2011, at 2:06 PM, Shawn Steele wrote: There are several sequences in Unicode which are meaningless if you have only one character and not the other. Eg: any of the variation selectors by themselves are meaningless. So if you break a modified character from its variation

RE: Full Unicode strings strawman

2011-05-19 Thread Shawn Steele
, mayo 19, 2011 3:00 PM To: Shawn Steele Cc: Waldemar Horwat; es-discuss@mozilla.org; Peter Constable Subject: Re: Full Unicode strings strawman On May 19, 2011, at 2:06 PM, Shawn Steele wrote: There are several sequences in Unicode which are meaningless if you have only one character

Re: Full Unicode strings strawman

2011-05-19 Thread Allen Wirfs-Brock
On May 19, 2011, at 3:35 PM, Shawn Steele wrote: I'm still not at all convinced :) I don't buy that the linguistic case isn't interesting Just to be clear, I'm not saying the linguistic case isn't interesting. It's obviously very interesting for a lot of application. I was trying to say

Re: Full Unicode strings strawman

2011-05-18 Thread 신정식, 申政湜
On Tue, May 17, 2011 at 11:09 AM, Shawn Steele shawn.ste...@microsoft.comwrote: I would much prefer changing UCS-2 to UTF-16, thus formalizing that surrogate pairs are permitted. That'd be very difficult to break any existing code and would still allow representation of everything reasonable

Re: Full Unicode strings strawman

2011-05-18 Thread Mark Davis ☕
On Tue, May 17, 2011 at 20:01, Wes Garland w...@page.ca wrote: Mark; Are you Dr. *Mark E. Davis* (born September 13, 1952 (age 58)), co-founder of the Unicode http://en.wikipedia.org/wiki/Unicode project and the president of the Unicode

RE: Full Unicode strings strawman

2011-05-18 Thread Shawn Steele
Hmm... I proposed break iterators for 'character/grapheme', word, line and sentence as a part of i18n API, but it's shot down (at least for version 0.5). Are you open to adding them now ? Once this discussion is settled and the proposal to support the full unicode range is in place, we can

Re: Full Unicode strings strawman

2011-05-18 Thread Mark Davis ☕
Yes, one of the options for the internal storage of the string class is to use different arrays depending on the contents. 1. uint8's if all the codepoint are =FF 2. uint16's if all the codepoint values = 3. uint32's otherwise That way the internal storage always corresponds

Re: Full Unicode strings strawman

2011-05-18 Thread Waldemar Horwat
On 05/16/11 11:11, Allen Wirfs-Brock wrote: I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason. Feed back would be appreciated: http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings Allen

Re: Full Unicode strings strawman

2011-05-17 Thread Norbert Lindenberg
I have read the discussion so far, but would like to come back to the strawman itself because I believe that it starts with a problem statement that's incorrect and misleading the discussion. Correctly describing the current situation would help in the discussion of possible changes, in

Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
On 16 May 2011 17:42, Boris Zbarsky bzbar...@mit.edu wrote: On 5/16/11 4:38 PM, Wes Garland wrote: Two great things about strings composed of Unicode code points: ... If though this is a breaking change from ES-5, I support it whole-heartedly but I expect breakage to be very limited.

Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky
On 5/17/11 10:40 AM, Wes Garland wrote: On 16 May 2011 17:42, Boris Zbarsky bzbar...@mit.edu Those aren't code points at all. They're just not Unicode. Not quite: code points D800-DFFF are reserved code points which are not representable with UTF-16. Nor with any other Unicode encoding,

Re: Full Unicode strings strawman

2011-05-17 Thread Brendan Eich
On May 16, 2011, at 8:13 PM, Allen Wirfs-Brock wrote: I think it does. In another reply I also mentioned the possibility of tagging in a JS visible manner strings that have gone through a known encoding process. Saw that, seems helpful. Want to spec it? If the strings you are combining

Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky
On 5/17/11 1:05 PM, Brendan Eich wrote: If the strings you are combining from different sources have not been canonicalize to a common encoding then you better be damn care how you combine them. Programmers miss this as you note, so arguably things are not much worse, at best no worse,

Re: Full Unicode strings strawman

2011-05-17 Thread Brendan Eich
On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote: Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding issues. I don't agree with Allen's claim that in the long run JS in the browser is going to have to be able to deal with arbitrary

Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky
On 5/17/11 1:27 PM, Brendan Eich wrote: On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote: Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding issues. I don't agree with Allen's claim that in the long run JS in the browser is going to

Re: Full Unicode strings strawman

2011-05-17 Thread Brendan Eich
On May 17, 2011, at 10:37 AM, Boris Zbarsky wrote: On 5/17/11 1:27 PM, Brendan Eich wrote: On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote: Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding issues. I don't agree with Allen's

Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky
On 5/17/11 1:40 PM, Brendan Eich wrote: On May 17, 2011, at 10:37 AM, Boris Zbarsky wrote: On 5/17/11 1:27 PM, Brendan Eich wrote: On May 17, 2011, at 10:22 AM, Boris Zbarsky wrote: Yes. And right now that's how it works and actual JS authors typically don't have to worry about encoding

Re: Full Unicode strings strawman

2011-05-17 Thread Brendan Eich
On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote: On 5/17/11 1:40 PM, Brendan Eich wrote: Where do you read forcing? Not in the words you cited. In the substance of having strings in different encodings around at the same time. If that doesn't force developers to worry about encodings,

Re: Full Unicode strings strawman

2011-05-17 Thread Brendan Eich
On May 17, 2011, at 10:47 AM, Brendan Eich wrote: On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote: On 5/17/11 1:40 PM, Brendan Eich wrote: Where do you read forcing? Not in the words you cited. In the substance of having strings in different encodings around at the same time. If that

RE: Full Unicode strings strawman

2011-05-17 Thread Shawn Steele
I would much prefer changing UCS-2 to UTF-16, thus formalizing that surrogate pairs are permitted. That'd be very difficult to break any existing code and would still allow representation of everything reasonable in Unicode. That would enable Unicode, and allow extending string literals and

Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky
On 5/17/11 1:47 PM, Brendan Eich wrote: On May 17, 2011, at 10:43 AM, Boris Zbarsky wrote: On 5/17/11 1:40 PM, Brendan Eich wrote: Where do you read forcing? Not in the words you cited. In the substance of having strings in different encodings around at the same time. If that doesn't

Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
On 17 May 2011 12:36, Boris Zbarsky bzbar...@mit.edu wrote: Not quite: code points D800-DFFF are reserved code points which are not representable with UTF-16. Nor with any other Unicode encoding, really. They don't represent, on their own, Unicode characters. Right - but they are still

Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky
On 5/17/11 2:12 PM, Wes Garland wrote: That said, you can encode these code points with utf-8; for example, 0xdc08 becomes 0xed 0xb0 0x88. By the same argument, you can encode them in UTF-16. The byte sequence above is not valid UTF-8. See How do I convert an unpaired UTF-16 surrogate to

Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky
On 5/17/11 2:24 PM, Allen Wirfs-Brock wrote: In the substance of having strings in different encodings around at the same time. If that doesn't force developers to worry about encodings, what does, exactly? This already occurs in JS. For example, the encodeURI function produces a string whose

RE: Full Unicode strings strawman

2011-05-17 Thread Phillips, Addison
...@microsoft.com] Sent: Tuesday, May 17, 2011 11:09 AM To: Brendan Eich; Boris Zbarsky Cc: es-discuss Subject: RE: Full Unicode strings strawman I would much prefer changing UCS-2 to UTF-16, thus formalizing that surrogate pairs are permitted. That'd be very difficult to break any existing

RE: Full Unicode strings strawman

2011-05-17 Thread Shawn Steele
Right - but they are still legitimate code points, and they fill out the space required to let us treat String as uint16[] when defining the backing store as something that maps to the set of all Unicode code points. That said, you can encode these code points with utf-8; for example,

Re: Full Unicode strings strawman

2011-05-17 Thread Allen Wirfs-Brock
On May 17, 2011, at 12:00 PM, Phillips, Addison wrote: Note: The W3C Internationalization Core WG published a set of requirements in this area for consideration by ES some time ago. It lives here: http://www.w3.org/International/wiki/JavaScriptInternationalization You might want to

RE: Full Unicode strings strawman

2011-05-17 Thread Phillips, Addison
[mailto:al...@wirfs-brock.com] Sent: Tuesday, May 17, 2011 12:16 PM To: Phillips, Addison Cc: Shawn Steele; Brendan Eich; Boris Zbarsky; es-discuss Subject: Re: Full Unicode strings strawman On May 17, 2011, at 12:00 PM, Phillips, Addison wrote: Note: The W3C Internationalization Core WG

Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
On 17 May 2011 14:39, Boris Zbarsky bzbar...@mit.edu wrote: On 5/17/11 2:12 PM, Wes Garland wrote: That said, you can encode these code points with utf-8; for example, 0xdc08 becomes 0xed 0xb0 0x88. By the same argument, you can encode them in UTF-16. The byte sequence above is not valid

Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
On 17 May 2011 15:00, Phillips, Addison addi...@lab126.com wrote: 2. Allowing unpaired surrogates is a *requirement*. Yes, such a string is ill-formed, but there are too many cases in which one might wish to have such broken strings for scripting purposes. 3. We should have escape syntax for

Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky
On 5/17/11 3:29 PM, Wes Garland wrote: But the point remains, the FAQ entry you quote talks about encoding a lone surrogate, i.e. a code unit, which is not a complete code point. You can only convert complete code points from one encoding to another. Just like you can't represent part of a UTF-8

Re: Full Unicode strings strawman

2011-05-17 Thread Mark Davis ☕
The wrong conclusion is being drawn. I can say definitively that for the string a\uD800b. - It is a valid Unicode string, according to the Unicode Standard. - It cannot be encoded as well-formed in any UTF-x (it is not 'well-formed' in any UTF). - When it comes to conversion, the bad

Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
On 17 May 2011 16:03, Boris Zbarsky bzbar...@mit.edu wrote: On 5/17/11 3:29 PM, Wes Garland wrote: The problem is that UTF-16 cannot represent all possible code points. My point is that neither can UTF-8. Can you name an encoding that _can_ represent the surrogate-range codepoints?

Re: Full Unicode strings strawman

2011-05-17 Thread Boris Zbarsky
On 5/17/11 5:24 PM, Wes Garland wrote: UTF-8 and UTF-32. I think UTF-7 can, too, but it is not a standard so it's not really worth discussing. UTF-16 is the odd one out. That's not what the spec says. Okay, I think we have to agree to disagree here. I believe my reading of the spec is

Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
On 17 May 2011 20:09, Boris Zbarsky bzbar...@mit.edu wrote: On 5/17/11 5:24 PM, Wes Garland wrote: Okay, I think we have to agree to disagree here. I believe my reading of the spec is correct. Sorry, but no... how much more clear can the spec get? In the past, I have read it thus,

Re: Full Unicode strings strawman

2011-05-17 Thread Mark Davis ☕
That is incorrect. See below. Mark *— Il meglio è l’inimico del bene —* On Tue, May 17, 2011 at 18:33, Wes Garland w...@page.ca wrote: On 17 May 2011 20:09, Boris Zbarsky bzbar...@mit.edu wrote: On 5/17/11 5:24 PM, Wes Garland wrote: Okay, I think we have to agree to disagree here. I

Re: Full Unicode strings strawman

2011-05-17 Thread Wes Garland
Mark; Are you Dr. *Mark E. Davis* (born September 13, 1952 (age 58)), co-founder of the Unicode http://en.wikipedia.org/wiki/Unicode project and the president of the Unicode Consortiumhttp://en.wikipedia.org/wiki/Unicode_Consortiumsince its incorporation in 1991? (If so, uh, thanks for giving me

Re: Full Unicode strings strawman

2011-05-16 Thread Mike Samuel
2011/5/16 Allen Wirfs-Brock al...@wirfs-brock.com: I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason. Feed back would be appreciated: http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings

RE: Full Unicode strings strawman

2011-05-16 Thread Shawn Steele
Thanks for making a strawman Unicode Escape Sequences Is it possible for U+ to accept either 4, 5, or 6 digit sequences? Typically when I encounter U+ notation the leading zero is omitted, and I see BMP characters quite often. Obviously BMP could use the U notation, however it seems like

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 11:30 AM, Mike Samuel wrote: 2011/5/16 Allen Wirfs-Brock al...@wirfs-brock.com: I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason. Feed back would be appreciated:

Re: Full Unicode strings strawman

2011-05-16 Thread Mike Samuel
2011/5/16 Shawn Steele shawn.ste...@microsoft.com: myString.replace( /[\ud800-\udbff](?![\udc00-\u])/g, \ufffd)    .replace( /(^|[^\ud800-\udbff])([\udc00-\ud])/g, \ufffd) My example code has typos. It should have read myString.replace( /[\ud800-\udbff](?![\udc00-\udfff])/g,

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 11:34 AM, Shawn Steele wrote: Thanks for making a strawman (see my very last sentence below as it may impact the interpreation of some of the rest of these responses) Unicode Escape Sequences Is it possible for U+ to accept either 4, 5, or 6 digit sequences?

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 12:28 PM, Mike Samuel wrote: DOMString is defined at http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578 thus Type Definition DOMString A DOMString is a sequence of 16-bit units. so how would round tripping a JS string through a DOM string work?

Re: Full Unicode strings strawman

2011-05-16 Thread Mike Samuel
2011/5/16 Allen Wirfs-Brock al...@wirfs-brock.com: On May 16, 2011, at 12:28 PM, Mike Samuel wrote: DOMString is defined at http://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-C74D1578 thus    Type Definition DOMString    A DOMString is a sequence of 16-bit units. so how would

Re: Full Unicode strings strawman

2011-05-16 Thread Wes Garland
Allen; Thanks for putting this together. We use Unicode data extensively in both our web and server-side applications, and being forced to deal with UTF-16 surrogate pair directly -- rather than letting the String implementation deal with them -- is a constant source of mild pain. At first

Re: Full Unicode strings strawman

2011-05-16 Thread Boris Zbarsky
On 5/16/11 4:37 PM, Mike Samuel wrote: You might have. If you reject my assertion about option 2 above, then to clarify, The UTF-16 representation of codepoint U+1 is the code-unit pair U+D8000 U+DC000. No. The UTF-16 representation of codepoint U+1 is the code-unit pair 0xD800

Re: Full Unicode strings strawman

2011-05-16 Thread Mike Samuel
2011/5/16 Wes Garland w...@page.ca: Mike Samuel, can you explain why you are en/decoding UTF-16 when round-tripping through the DOM? I was UTF-16 encoding it because there will be host objects in browsers that assume a UTF-16 encoding and so a possibility for orphaned surrogates in internal

Re: Full Unicode strings strawman

2011-05-16 Thread Mike Samuel
2011/5/16 Boris Zbarsky bzbar...@mit.edu: On 5/16/11 4:37 PM, Mike Samuel wrote: You might have.  If you reject my assertion about option 2 above, then to clarify, The UTF-16 representation of codepoint U+1 is the code-unit pair U+D8000 U+DC000. No.  The UTF-16 representation of

Re: Full Unicode strings strawman

2011-05-16 Thread Mark Davis ☕
I'm quite sympathetic to the goal, but the proposal does represent a significant breaking change. The problem, as Shawn points out, is with indexing. Before, the strings were defined as UTF16. Take a sample string \ud800\udc00\u0061 = \u{1}\u{61}. Right now, the 'a' (the \u{61}) is at offset

Re: Full Unicode strings strawman

2011-05-16 Thread Mike Samuel
Allen, could you clarify something. When the strawman says without mentioning codepoint The String type is the set of all finite ordered sequences of zero or more 16-bit\b\b\b\b\b\b 21-bit unsigned integer values (“elements”). does that mean that String.charCodeAt(...) can return any value in

RE: Full Unicode strings strawman

2011-05-16 Thread Shawn Steele
, 2011 12:53 PM To: Shawn Steele Cc: es-discuss@mozilla.org Subject: Re: Full Unicode strings strawman On May 16, 2011, at 11:34 AM, Shawn Steele wrote: Thanks for making a strawman (see my very last sentence below as it may impact the interpreation of some of the rest of these responses

Re: Full Unicode strings strawman

2011-05-16 Thread 신정식, 申政湜
On Mon, May 16, 2011 at 2:19 PM, Mark Davis ☕ m...@macchiato.com wrote: I'm quite sympathetic to the goal, but the proposal does represent a significant breaking change. The problem, as Shawn points out, is with indexing. Before, the strings were defined as UTF16. I agree with Mark wrote

RE: Full Unicode strings strawman

2011-05-16 Thread Shawn Steele
: Monday, May 16, 2011 2:24 PM To: Mark Davis ☕ Cc: Markus Scherer; es-discuss@mozilla.org Subject: Re: Full Unicode strings strawman On Mon, May 16, 2011 at 2:19 PM, Mark Davis ☕ m...@macchiato.commailto:m...@macchiato.com wrote: I'm quite sympathetic to the goal, but the proposal does represent

Re: Full Unicode strings strawman

2011-05-16 Thread Boris Zbarsky
On 5/16/11 4:38 PM, Wes Garland wrote: Two great things about strings composed of Unicode code points: ... If though this is a breaking change from ES-5, I support it whole-heartedly but I expect breakage to be very limited. Provided that the implementation does not restrict the storage of

Re: Full Unicode strings strawman

2011-05-16 Thread Mark Davis ☕
-discuss-boun...@mozilla.org] *On Behalf Of *Jungshik Shin (???, ???) *Sent:* Monday, May 16, 2011 2:24 PM *To:* Mark Davis ☕ *Cc:* Markus Scherer; es-discuss@mozilla.org *Subject:* Re: Full Unicode strings strawman On Mon, May 16, 2011 at 2:19 PM, Mark Davis ☕ m...@macchiato.com wrote

Re: Full Unicode strings strawman

2011-05-16 Thread Boris Zbarsky
On 5/16/11 5:16 PM, Mike Samuel wrote: The strawman says The String type is the set of all finite ordered sequences of zero or more 21-bit unsigned integer values (“elements”). Yeah, that's not the same thing as an actual Unicode string, and requires handling of all sorts of what if someone

RE: Full Unicode strings strawman

2011-05-16 Thread Shawn Steele
PM To: Shawn Steele Cc: Jungshik Shin (신정식, 申政湜); Markus Scherer; es-discuss@mozilla.org Subject: Re: Full Unicode strings strawman In terms of implementation capabilities, there isn't really a significant practical difference between * a UCS-2 implementation, and * a UTF-16

Re: Full Unicode strings strawman

2011-05-16 Thread Boris Zbarsky
On 5/16/11 5:23 PM, Shawn Steele wrote: I’m having some (ok, a great deal of) confusion between the DOM Encoding and the JavaScript encoding and whatever. I’d assumed that if I had a web page in some encoding, that it was converted to UTF-16 (well, UCS-2), and that’s what the JavaScript engine

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 1:37 PM, Mike Samuel wrote: 2011/5/16 Allen Wirfs-Brock al...@wirfs-brock.com: ... How would var oneSupplemental = \U0001; I don't think I understand you literal notation. \U is a 32-bit character value? I whose implementation? Sorry, please read this

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 1:38 PM, Wes Garland wrote: Allen; Thanks for putting this together. We use Unicode data extensively in both our web and server-side applications, and being forced to deal with UTF-16 surrogate pair directly -- rather than letting the String implementation deal

Re: Full Unicode strings strawman

2011-05-16 Thread Mark Davis ☕
A correction. U+D800 is indeed a code point: http://www.unicode.org/glossary/#Code_Point. It is defined for usage in Unicode Strings (see http://www.unicode.org/glossary/#Unicode_String) because often it is useful for implementations to be able to allow it in processing. It does, however, have a

Re: Full Unicode strings strawman

2011-05-16 Thread Mark Davis ☕
In practice, the supplemental code points don't really cause problems in Unicode strings. Most implementations just treat them as if they were unassigned. The only important issue is that *when* they are converted to UTF-xx for storage or transmission, they need to be handled; typically by

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 2:16 PM, Mike Samuel wrote: 2011/5/16 Boris Zbarsky bzbar...@mit.edu: On 5/16/11 4:37 PM, Mike Samuel wrote: There is no Unicode codepoint U+D800 or U+DC00. See http://www.unicode.org/charts/PDF/UD800.pdf and http://www.unicode.org/charts/PDF/UDC00.pdf which

RE: Full Unicode strings strawman

2011-05-16 Thread Shawn Steele
...@mozilla.org [mailto:es-discuss-boun...@mozilla.org] On Behalf Of Allen Wirfs-Brock Sent: Monday, May 16, 2011 3:18 PM To: Mark Davis ☕ Cc: Markus Scherer; es-discuss@mozilla.org Subject: Re: Full Unicode strings strawman On May 16, 2011, at 2:19 PM, Mark Davis ☕ wrote: I'm quite sympathetic to the goal

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
See the section of the proposal about String.prototype.charCodeAt On May 16, 2011, at 2:20 PM, Mike Samuel wrote: Allen, could you clarify something. When the strawman says without mentioning codepoint The String type is the set of all finite ordered sequences of zero or more

Re: Full Unicode strings strawman

2011-05-16 Thread Mike Samuel
2011/5/16 Allen Wirfs-Brock al...@wirfs-brock.com: I think you have an extra 0 at a couple of  places above... Yep. Sorry. The 0x1 really is supposed to be five digits though. A DOMstring is defined by the DOM spec. to consists of 16-bit elements that are to be interpreted as a UTF-16

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 2:42 PM, Boris Zbarsky wrote: On 5/16/11 4:38 PM, Wes Garland wrote: Two great things about strings composed of Unicode code points: ... If though this is a breaking change from ES-5, I support it whole-heartedly but I expect breakage to be very limited. Provided

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 3:22 PM, Shawn Steele wrote: The problem is that “\UD800\UDC00” === “\U+01”. And if the internal representation is UTF-32, then they’d have to continue to be the same. And it’s really hard for them to have the same length if one’s 2 code points and the other’s 1

RE: Full Unicode strings strawman

2011-05-16 Thread Shawn Steele
Not in my proposal! \ud800\udc00=== \u+01 is false in my proposal. That’s exactly my problem. I think the engine’s (or at least the applications written in JavaScript) are still UTF-16-centric and that they’ll have d800, dc00 === 1. For example, if they were different, then d800,

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 3:33 PM, Mike Samuel wrote: 2011/5/16 Allen Wirfs-Brock al...@wirfs-brock.com: Really? There is existing code out there that uses particular implementations for strings. Should the cost of migrating existing implementations be taken into account when considering

Re: Full Unicode strings strawman

2011-05-16 Thread Brendan Eich
On May 16, 2011, at 2:07 PM, Boris Zbarsky wrote: That said, defining JS strings and DOMString differently seems like a recipe for serious author confusion (e.g. actually using JS strings as the DOMString binding in ES might be lossy, assigning from JS strings to DOMString might be lossy,

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 4:21 PM, Shawn Steele wrote: Not in my proposal! \ud800\udc00=== \u+01 is false in my proposal. That’s exactly my problem. I think the engine’s (or at least the applications written in JavaScript) are still UTF-16-centric and that they’ll have d800, dc00 ===

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 5:06 PM, Brendan Eich wrote: On May 16, 2011, at 2:07 PM, Boris Zbarsky wrote: That said, defining JS strings and DOMString differently seems like a recipe for serious author confusion (e.g. actually using JS strings as the DOMString binding in ES might be lossy,

RE: Full Unicode strings strawman

2011-05-16 Thread Shawn Steele
I think you'll find that the actual JS engines are currently UCS-2 centric. The surrounding browser environments are doing the UTF-16 interpretation. That why you see instead of �� in browser generated display output. There’s no difference. I wouldn’t call Windows C++ WCHAR “UCS-2”, however

Re: Full Unicode strings strawman

2011-05-16 Thread Boris Zbarsky
On 5/16/11 6:18 PM, Allen Wirfs-Brock wrote: It the string is written as \ud800\udc00\u0061 the 'a' will be at offset 1, even in the new proposal. It would only be at offset 1 if it was written as \u+01\u+61 (using the literal notation from the proposal). Ah, so in the proposal strings

Re: Full Unicode strings strawman

2011-05-16 Thread Mike Samuel
2011/5/16 Allen Wirfs-Brock al...@wirfs-brock.com: It the string is written as   \ud800\udc00\u0061 the 'a' will be at offset 1, even in the new proposal.  It would only be at offset 1 if it was written as \u+01\u+61  (using the literal notation from the proposal). Under this scheme,

Re: Full Unicode strings strawman

2011-05-16 Thread Boris Zbarsky
On 5/16/11 7:21 PM, Shawn Steele wrote: In other words I don’t think you can get the engine to be completely UTF-32. At least not without declaring a page as being UTF-32. For what it's worth, HTML5 does not support declaring a page as UTF-32 at all. We're removing our existing support for

Re: Full Unicode strings strawman

2011-05-16 Thread Brendan Eich
On May 16, 2011, at 5:18 PM, Allen Wirfs-Brock wrote: On May 16, 2011, at 5:06 PM, Brendan Eich wrote: On May 16, 2011, at 2:07 PM, Boris Zbarsky wrote: That said, defining JS strings and DOMString differently seems like a recipe for serious author confusion (e.g. actually using JS

Re: Full Unicode strings strawman

2011-05-16 Thread Boris Zbarsky
On 5/16/11 10:20 PM, Allen Wirfs-Brock wrote: That seems like it'll make it very easy to introduce strings that are a mix of the two via concatenation Some implementations already use tree structures to represent strings that are built via concatenation. It would be straight forward to

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
It already ins't the case that eval(x)===JSON.parse(x). See http://timelessrepo.com/json-isnt-a-javascript-subset On May 16, 2011, at 6:51 PM, Mike Samuel wrote: 2011/5/16 Allen Wirfs-Brock al...@wirfs-brock.com: It the string is written as \ud800\udc00\u0061 the 'a' will be at offset 1,

Re: Full Unicode strings strawman

2011-05-16 Thread Mike Samuel
2011/5/16 Allen Wirfs-Brock al...@wirfs-brock.com: It already ins't the case that eval(x)===JSON.parse(x).  See http://timelessrepo.com/json-isnt-a-javascript-subset I'm aware of that hole. That doesn't mean that we should break the relationship for code that doesn't error out in either.

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 7:22 PM, Boris Zbarsky wrote: On 5/16/11 10:20 PM, Allen Wirfs-Brock wrote: That seems like it'll make it very easy to introduce strings that are a mix of the two via concatenation Some implementations already use tree structures to represent strings that are

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 7:53 PM, Mike Samuel wrote: 2011/5/16 Allen Wirfs-Brock al...@wirfs-brock.com: It already ins't the case that eval(x)===JSON.parse(x). See http://timelessrepo.com/json-isnt-a-javascript-subset I'm aware of that hole. That doesn't mean that we should break the

Re: Full Unicode strings strawman

2011-05-16 Thread Allen Wirfs-Brock
On May 16, 2011, at 7:18 PM, Brendan Eich wrote: On May 16, 2011, at 5:18 PM, Allen Wirfs-Brock wrote: On May 16, 2011, at 5:06 PM, Brendan Eich wrote: On May 16, 2011, at 2:07 PM, Boris Zbarsky wrote: That said, defining JS strings and DOMString differently seems like a recipe for