[v8-users] Re: some minor string confusion :)

Christian Plesner Hansen Wed, 22 Oct 2008 01:20:47 -0700

Note also that you can't generally tell whether a program will behave
correctly under UCS-2.  For instance, consider this program:


var dci = String.fromCharCode(0xD801) + String.fromCharCode(0xDC00);
var dli = dci.toLowerCase();
print(dci == dli);

(dci is a deseret capital I, represented by a surrogate pair).  Under
UCS-2 this program prints true, under UTF-16 it prints false.
Programs like this cannote be detected reliably.

On Wed, Oct 22, 2008 at 9:45 AM, Erik Corry <[EMAIL PROTECTED]> wrote:
> It's worth remembering that if you put UTF-16 into a JS string and then get
> the UTF-16 out again then you will not lose any data.  In a sense V8 is
> transparent to UTF-16.  It's only when you manipulate the string in JS in
> certain ways that you risk 'corruption'.  For example if you use substring
> to cut a string in the middle of a surrogate pair then the result will no
> longer be valid UTF-16.
>
> On Wed, Oct 22, 2008 at 3:08 AM, Pete Gontier <[EMAIL PROTECTED]> wrote:
>>
>> On Oct 21, 2008, at 1:45 AM, Christian Plesner Hansen wrote:
>>
>> An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript can be
>> expected to support UTF-16, even though it's not strictly required, and
>> comments in the V8 headers suggest that strings are indeed UTF-16. However,
>> in the real world, it turns out that JavaScript string functions and regular
>> expressions are not required to support UTF-16, which can have surrogate
>> pairs (multiple 16-bit quantities representing a single character). Because
>> of this pre-existing condition, V8 is not in a position to do better, since
>> this would break compatibility with other JavaScript engines.
>>
>> In most cases the spec tells us to treat strings as UCS-2, including most
>> string operations like charAt and case conversion.  This is not optional,
>> handling surrogate pairs would actually be incorrect according to the spec.
>>  In a few cases (I can only think of 'eval' but there may be more) the spec
>> says to treat strings as UTF-16.  Again, this is not optional.
>>
>> As you say, for compatibility reasons we would be reluctant to switch any
>> of the places we use UCS-2 to using UTF-16.  However, for most operations I
>> think the switch could be made without breaking any code on the web.  For
>> instance, JavaScriptCore uses UTF-16 for case conversion and it doesn't seem
>> to be an issue.
>>
>> So now my question is whether people expect to be able to use/store UTF-16
>> in JavaScript even though this cannot be expected to work reliably for
>> anything beyond the simplest read/write cases. I'm pondering whether I'd be
>> doing my customers (client developers) a favor by using iconv to convert all
>> text to UCS-2 before handing it to V8. This would give me an opportunity to
>> detect that the input characters cannot be converted to UCS-2 before they
>> ever got into V8 and caused subtle problems, possibly much farther down the
>> road when it would be difficult to figure them out.
>>
>> This is an application specific question, it's very hard to give a general
>> answer.  If your program depends on string operations being correct
>> according to the unicode standard, for instance that surrogate pairs are
>> converted correctly to upper and lower case, then you're in trouble if your
>> program is written in JavaScript.  However, most of the language and even
>> many string operations are unaffected by this, and the operations that are
>> affected still use a consistent and reliable model -- it is just not the
>> same as the unicode model.
>>
>> Thanks for the insight and thanks in advance for tolerating my thinking
>> out loud here.
>> The app in question is an application server in early development. When I
>> say "customers (client developers)", I'm referring to the future. Happily,
>> I'm not concerned about a large body of existing code. As well, I don't
>> think I need to be concerned about militant JavaScript activists demanding
>> UTF-16 in the few cases it's allowed.
>> So, on one hand, I may have an opportunity now to prevent some heart-ache
>> and head-scratching, and I'm somewhat inclined to be a proactive paranoid
>> gatekeeper and require every string coming in from the outside world to
>> convert with full fidelity to UCS-2, even if there are some cases (such as
>> 'eval') which would tolerate UTF-16.
>> On the other hand, I'm not so crazy as to think I want to implement every
>> bit of this application server myself, and there may well be script
>> libraries written primarily for use within web browsers which I would like
>> to incorporate -- or anyway make it possible to incorporate. I suppose if
>> the only strings they ever see are UCS-2, then they will work just fine, but
>> if they have features which depend on UTF-16, those will break or cause
>> breakage. I bet such features are few and far between, but I can't know
>> conclusively. Hmmm.
>> I suppose one approach would be to use UCS-2 until someone complains. :-)
>>
>> Pete Gontier <http://pete.gontier.org/>
>>
>>
>
>
>
> --
> Erik Corry, Software Engineer
> Google Denmark ApS.  CVR nr. 28 86 69 84
> c/o Philip & Partners, 7 Vognmagergade, P.O. Box 2227, DK-1018 Copenhagen K,
> Denmark.
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
v8-users mailing list
[email protected]
http://groups.google.com/group/v8-users
-~----------~----~----~----~------~----~------~--~---

[v8-users] Re: some minor string confusion :)

Reply via email to