[v8-users] Re: some minor string confusion :)

Christian Plesner Hansen Tue, 21 Oct 2008 01:45:20 -0700

> An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript can be
> expected to support UTF-16, even though it's not strictly required, and
> comments in the V8 headers suggest that strings are indeed UTF-16. However,
> in the real world, it turns out that JavaScript string functions and regular
> expressions are not required to support UTF-16, which can have surrogate
> pairs (multiple 16-bit quantities representing a single character). Because
> of this pre-existing condition, V8 is not in a position to do better, since
> this would break compatibility with other JavaScript engines.


In most cases the spec tells us to treat strings as UCS-2, including
most string operations like charAt and case conversion.  This is not
optional, handling surrogate pairs would actually be incorrect
according to the spec.  In a few cases (I can only think of 'eval' but
there may be more) the spec says to treat strings as UTF-16.  Again,
this is not optional.

As you say, for compatibility reasons we would be reluctant to switch
any of the places we use UCS-2 to using UTF-16.  However, for most
operations I think the switch could be made without breaking any code
on the web.  For instance, JavaScriptCore uses UTF-16 for case
conversion and it doesn't seem to be an issue.

> So now my question is whether people expect to be able to use/store UTF-16
> in JavaScript even though this cannot be expected to work reliably for
> anything beyond the simplest read/write cases. I'm pondering whether I'd be
> doing my customers (client developers) a favor by using iconv to convert all
> text to UCS-2 before handing it to V8. This would give me an opportunity to
> detect that the input characters cannot be converted to UCS-2 before they
> ever got into V8 and caused subtle problems, possibly much farther down the
> road when it would be difficult to figure them out.

This is an application specific question, it's very hard to give a
general answer.  If your program depends on string operations being
correct according to the unicode standard, for instance that surrogate
pairs are converted correctly to upper and lower case, then you're in
trouble if your program is written in JavaScript.  However, most of
the language and even many string operations are unaffected by this,
and the operations that are affected still use a consistent and
reliable model -- it is just not the same as the unicode model.

--~--~---------~--~----~------------~-------~--~----~
v8-users mailing list
[email protected]
http://groups.google.com/group/v8-users
-~----------~----~----~----~------~----~------~--~---

[v8-users] Re: some minor string confusion :)

Reply via email to