> An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript can be > expected to support UTF-16, even though it's not strictly required, and > comments in the V8 headers suggest that strings are indeed UTF-16. However, > in the real world, it turns out that JavaScript string functions and regular > expressions are not required to support UTF-16, which can have surrogate > pairs (multiple 16-bit quantities representing a single character). Because > of this pre-existing condition, V8 is not in a position to do better, since > this would break compatibility with other JavaScript engines.
In most cases the spec tells us to treat strings as UCS-2, including most string operations like charAt and case conversion. This is not optional, handling surrogate pairs would actually be incorrect according to the spec. In a few cases (I can only think of 'eval' but there may be more) the spec says to treat strings as UTF-16. Again, this is not optional. As you say, for compatibility reasons we would be reluctant to switch any of the places we use UCS-2 to using UTF-16. However, for most operations I think the switch could be made without breaking any code on the web. For instance, JavaScriptCore uses UTF-16 for case conversion and it doesn't seem to be an issue. > So now my question is whether people expect to be able to use/store UTF-16 > in JavaScript even though this cannot be expected to work reliably for > anything beyond the simplest read/write cases. I'm pondering whether I'd be > doing my customers (client developers) a favor by using iconv to convert all > text to UCS-2 before handing it to V8. This would give me an opportunity to > detect that the input characters cannot be converted to UCS-2 before they > ever got into V8 and caused subtle problems, possibly much farther down the > road when it would be difficult to figure them out. This is an application specific question, it's very hard to give a general answer. If your program depends on string operations being correct according to the unicode standard, for instance that surrogate pairs are converted correctly to upper and lower case, then you're in trouble if your program is written in JavaScript. However, most of the language and even many string operations are unaffected by this, and the operations that are affected still use a consistent and reliable model -- it is just not the same as the unicode model. --~--~---------~--~----~------------~-------~--~----~ v8-users mailing list [email protected] http://groups.google.com/group/v8-users -~----------~----~----~----~------~----~------~--~---
