On Oct 21, 2008, at 1:45 AM, Christian Plesner Hansen wrote: >> An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript >> can be expected to support UTF-16, even though it's not strictly >> required, and comments in the V8 headers suggest that strings are >> indeed UTF-16. However, in the real world, it turns out that >> JavaScript string functions and regular expressions are not >> required to support UTF-16, which can have surrogate pairs >> (multiple 16-bit quantities representing a single character). >> Because of this pre-existing condition, V8 is not in a position to >> do better, since this would break compatibility with other >> JavaScript engines.
> In most cases the spec tells us to treat strings as UCS-2, including > most string operations like charAt and case conversion. This is not > optional, handling surrogate pairs would actually be incorrect > according to the spec. In a few cases (I can only think of 'eval' > but there may be more) the spec says to treat strings as UTF-16. > Again, this is not optional. > As you say, for compatibility reasons we would be reluctant to > switch any of the places we use UCS-2 to using UTF-16. However, for > most operations I think the switch could be made without breaking > any code on the web. For instance, JavaScriptCore uses UTF-16 for > case conversion and it doesn't seem to be an issue. >> So now my question is whether people expect to be able to use/store >> UTF-16 in JavaScript even though this cannot be expected to work >> reliably for anything beyond the simplest read/write cases. I'm >> pondering whether I'd be doing my customers (client developers) a >> favor by using iconv to convert all text to UCS-2 before handing it >> to V8. This would give me an opportunity to detect that the input >> characters cannot be converted to UCS-2 before they ever got into >> V8 and caused subtle problems, possibly much farther down the road >> when it would be difficult to figure them out. > This is an application specific question, it's very hard to give a > general answer. If your program depends on string operations being > correct according to the unicode standard, for instance that > surrogate pairs are converted correctly to upper and lower case, > then you're in trouble if your program is written in JavaScript. > However, most of the language and even many string operations are > unaffected by this, and the operations that are affected still use a > consistent and reliable model -- it is just not the same as the > unicode model. Thanks for the insight and thanks in advance for tolerating my thinking out loud here. The app in question is an application server in early development. When I say "customers (client developers)", I'm referring to the future. Happily, I'm not concerned about a large body of existing code. As well, I don't think I need to be concerned about militant JavaScript activists demanding UTF-16 in the few cases it's allowed. So, on one hand, I may have an opportunity now to prevent some heart- ache and head-scratching, and I'm somewhat inclined to be a proactive paranoid gatekeeper and require every string coming in from the outside world to convert with full fidelity to UCS-2, even if there are some cases (such as 'eval') which would tolerate UTF-16. On the other hand, I'm not so crazy as to think I want to implement every bit of this application server myself, and there may well be script libraries written primarily for use within web browsers which I would like to incorporate -- or anyway make it possible to incorporate. I suppose if the only strings they ever see are UCS-2, then they will work just fine, but if they have features which depend on UTF-16, those will break or cause breakage. I bet such features are few and far between, but I can't know conclusively. Hmmm. I suppose one approach would be to use UCS-2 until someone complains. :-) Pete Gontier <http://pete.gontier.org/> --~--~---------~--~----~------------~-------~--~----~ v8-users mailing list [email protected] http://groups.google.com/group/v8-users -~----------~----~----~----~------~----~------~--~---
