RE: Full Unicode strings strawman

Shawn Steele Mon, 16 May 2011 14:23:29 -0700

I'm having some (ok, a great deal of) confusion between the DOM Encoding and 
the JavaScript encoding and whatever.  I'd assumed that if I had a web page in 
some encoding, that it was converted to UTF-16 (well, UCS-2), and that's what 
the JavaScript engine did it's work on.  I confess to not having done much 
encoding stuff in JS in the last decade.


In UTF-8, individually encoded surrogates are illegal (and a security risk).  
Eg: you shouldn't be able to encode D800/DC00 as two 3 byte sequences, they 
should be a single 6 byte sequence.  Having not played with the js 
encoding/decoding in quite some time, I'm not sure what they do in that case, 
but hopefully it isn't illegal UTF-8.  (You also shouldn't be able to have half 
a surrogate pair in UTF-16, but many things are pretty lax about that.)

-Shawn

From: Allen Wirfs-Brock [mailto:al...@wirfs-brock.com]
Sent: Monday, May 16, 2011 12:53 PM
To: Shawn Steele
Cc: es-discuss@mozilla.org
Subject: Re: Full Unicode strings strawman


On May 16, 2011, at 11:34 AM, Shawn Steele wrote:


Thanks for making a strawman
(see my very last sentence below as it may impact the interpreation of some of 
the rest of these responses)




Unicode Escape Sequences
Is it possible for U+ to accept either 4, 5, or 6 digit sequences?   Typically 
when I encounter U+ notation the leading zero is omitted, and I see BMP 
characters quite often.  Obviously BMP could use the U notation, however it 
seems like it'd be annoying to the occasional user to know that U is used for 
some and U+ for others.  Seems like it'd be easier for developers to remember 
that U+ is "the new way" and U is "the old way that doesn't always work".

The ES string literal notation does't really accommodate  variable length 
subtokens without explicit terminators.  What would be the rules for parsing 
"\u+12345678".  How do we know if the programmer meant "\u1234"+"5678" or 
"\u0012"+"345678" or ...

There have been past proposals for a syntax like \u{xxxxxx} that could have 1to 
6 hex digits.  In the past proposal the assumption was that it would produce 
UTF-16 surrogate pairs but in this context we could adopt it instead of \u+ to 
produce a single character.  The disadvantage is that it is a slightly long 
sequence for actual large code points.  On the other hand perhaps it is more 
readable?  "\u+123456\u+123456" vs. "\u{123456}\u{123456}" ??




String Position
It's unclear to me if the string indices can be "changed" from UTF-16 to UTF-32 
positions.  Although UTF-32 indices are clearly desirable, I think that many 
implementations currently allow UTF-16 codepoints U+D800 through U+DFFF.  In 
other words, I can already have Javascript strings with full Unicode range data 
in them.  Existing applications would then have indices that pointed to the 
UTF-16, not UTF-32 index.  Changing the definition of the index to UTF-32 would 
break those applications I think.

No it wouldn't break anything, at least when applied to existing data.  Your 
existing code is explicitly doing UTF-16 processing.  Somebody had to do the 
processing to create the surrogate pairs in the string. As long as you use that 
same agent to are still going to bet UTF-16 encoded strings. Even though the 
underlying character values could hold single characters with codepoints > 
\uffff the actual string won't unless unless somebody actually constructed the 
string to contain such values.  That presumably doesn't happen for existing 
code.

The place where existing code might break is if somebody explicitly constructs 
a string (using \u+ literals or String.fromCodepoint) that contains non-BMP 
characters and passes it to routines that that only expect 16-bits characters.  
For this reason, any existing host routines that convert external data 
resources to ES strings that contain surrogate pairs should probably continue 
to do so.  New routines should be provided that produce single characters 
instead of pairs for non-BMP pointpoints.  However, the definition of such 
routines is outside the scope of the ES specification.

Finally, note that just as current strings can contain16-bit character values 
that are not valid Unicode code points, the expanded full unicode strings can 
also contain 21-bit character values that are not valid Unicode codepoints.


You also touch on that with charCodeAt/codepointAt, which resolves the problem 
with the output type, but doesn't address the problem with the indexing.  
Similar to the way you differentiated charCode/codepoint, it may be necessary 
to differentiate charCode/codepoint indices.  IMO .fromCharCode doesn't have 
this problem since it used to fail, but now works, which wouldn't be breaking.  
Unless we're concerned that now it can return a different UTF-16 length than 
before.

Again, nothing changes.  Code that expects to deal with multi-character 
encodings can still do so.   What "magically" changes is that code that act 
Unicode like codepoints are only 16-bits (ie, the code doesn't correctly deal 
with surrogate pairs) will now work with full 21-bit characters.



I don't like the "21" in the name of decodeURI21.

Suggestions for better names are always welcome.



  Also, the "trick" I think, is encoding to surrogate pairs (illegally, since 
UTF8 doesn't allow that) vs decoding to UTF16.  It seems like decoding can 
safely detect input supplementary characters and properly decode them, or is 
there something about encoding that doesn't make that state detectable?

I think I missing the distinction you are making between surrogate pairs and 
UTF-16.  I think I've been using the terms interchangeably.  I may be munging 
up the terminology.






-Shawn

From: es-discuss-boun...@mozilla.org<mailto:es-discuss-boun...@mozilla.org> 
[mailto:es-discuss-boun...@mozilla.org] On Behalf Of Allen Wirfs-Brock
Sent: Monday, May 16, 2011 11:12 AM
To: es-discuss@mozilla.org<mailto:es-discuss@mozilla.org>
Subject: Full Unicode strings strawman

I tried to post a pointer to this strawman on this list a few weeks ago, but 
apparently it didn't reach the list for some reason.

Feed back would be appreciated:

http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings

Allen

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

RE: Full Unicode strings strawman

Reply via email to