[v8-users] Re: some minor string confusion :)

Erik Corry Wed, 22 Oct 2008 00:45:29 -0700

It's worth remembering that if you put UTF-16 into a JS string and then get
the UTF-16 out again then you will not lose any data.  In a sense V8 is
transparent to UTF-16.  It's only when you manipulate the string in JS in
certain ways that you risk 'corruption'.  For example if you use substring
to cut a string in the middle of a surrogate pair then the result will no
longer be valid UTF-16.


On Wed, Oct 22, 2008 at 3:08 AM, Pete Gontier <[EMAIL PROTECTED]> wrote:

> On Oct 21, 2008, at 1:45 AM, Christian Plesner Hansen wrote:
>
> An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript can be
> expected to support UTF-16, even though it's not strictly required, and
> comments in the V8 headers suggest that strings are indeed UTF-16. However,
> in the real world, it turns out that JavaScript string functions and regular
> expressions are not required to support UTF-16, which can have surrogate
> pairs (multiple 16-bit quantities representing a single character). Because
> of this pre-existing condition, V8 is not in a position to do better, since
> this would break compatibility with other JavaScript engines.
>
>
> In most cases the spec tells us to treat strings as UCS-2, including most
> string operations like charAt and case conversion.  This is not optional,
> handling surrogate pairs would actually be incorrect according to the spec.
>  In a few cases (I can only think of 'eval' but there may be more) the spec
> says to treat strings as UTF-16.  Again, this is not optional.
>
>
> As you say, for compatibility reasons we would be reluctant to switch any
> of the places we use UCS-2 to using UTF-16.  However, for most operations I
> think the switch could be made without breaking any code on the web.  For
> instance, JavaScriptCore uses UTF-16 for case conversion and it doesn't seem
> to be an issue.
>
>
> So now my question is whether people expect to be able to use/store UTF-16
> in JavaScript even though this cannot be expected to work reliably for
> anything beyond the simplest read/write cases. I'm pondering whether I'd be
> doing my customers (client developers) a favor by using iconv to convert all
> text to UCS-2 before handing it to V8. This would give me an opportunity to
> detect that the input characters cannot be converted to UCS-2 before they
> ever got into V8 and caused subtle problems, possibly much farther down the
> road when it would be difficult to figure them out.
>
>
> This is an application specific question, it's very hard to give a general
> answer.  If your program depends on string operations being correct
> according to the unicode standard, for instance that surrogate pairs are
> converted correctly to upper and lower case, then you're in trouble if your
> program is written in JavaScript.  However, most of the language and even
> many string operations are unaffected by this, and the operations that are
> affected still use a consistent and reliable model -- it is just not the
> same as the unicode model.
>
>
> Thanks for the insight and thanks in advance for tolerating my thinking out
> loud here.
>
> The app in question is an application server in early development. When I
> say "customers (client developers)", I'm referring to the future. Happily,
> I'm not concerned about a large body of existing code. As well, I don't
> think I need to be concerned about militant JavaScript activists demanding
> UTF-16 in the few cases it's allowed.
>
> So, on one hand, I may have an opportunity now to prevent some heart-ache
> and head-scratching, and I'm somewhat inclined to be a proactive paranoid
> gatekeeper and require every string coming in from the outside world to
> convert with full fidelity to UCS-2, even if there are some cases (such as
> 'eval') which would tolerate UTF-16.
>
> On the other hand, I'm not so crazy as to think I want to implement every
> bit of this application server myself, and there may well be script
> libraries written primarily for use within web browsers which I would like
> to incorporate -- or anyway make it possible to incorporate. I suppose if
> the only strings they ever see are UCS-2, then they will work just fine, but
> if they have features which depend on UTF-16, those will break or cause
> breakage. I bet such features are few and far between, but I can't know
> conclusively. Hmmm.
>
> I suppose one approach would be to use UCS-2 until someone complains. :-)
>
>
> Pete Gontier <http://pete.gontier.org/>
>
>
> >
>


-- 
Erik Corry, Software Engineer
Google Denmark ApS.  CVR nr. 28 86 69 84
c/o Philip & Partners, 7 Vognmagergade, P.O. Box 2227, DK-1018 Copenhagen K,
Denmark.

--~--~---------~--~----~------------~-------~--~----~
v8-users mailing list
[email protected]
http://groups.google.com/group/v8-users
-~----------~----~----~----~------~----~------~--~---

[v8-users] Re: some minor string confusion :)

Reply via email to