Note also that you can't generally tell whether a program will behave correctly under UCS-2. For instance, consider this program:
var dci = String.fromCharCode(0xD801) + String.fromCharCode(0xDC00); var dli = dci.toLowerCase(); print(dci == dli); (dci is a deseret capital I, represented by a surrogate pair). Under UCS-2 this program prints true, under UTF-16 it prints false. Programs like this cannote be detected reliably. On Wed, Oct 22, 2008 at 9:45 AM, Erik Corry <[EMAIL PROTECTED]> wrote: > It's worth remembering that if you put UTF-16 into a JS string and then get > the UTF-16 out again then you will not lose any data. In a sense V8 is > transparent to UTF-16. It's only when you manipulate the string in JS in > certain ways that you risk 'corruption'. For example if you use substring > to cut a string in the middle of a surrogate pair then the result will no > longer be valid UTF-16. > > On Wed, Oct 22, 2008 at 3:08 AM, Pete Gontier <[EMAIL PROTECTED]> wrote: >> >> On Oct 21, 2008, at 1:45 AM, Christian Plesner Hansen wrote: >> >> An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript can be >> expected to support UTF-16, even though it's not strictly required, and >> comments in the V8 headers suggest that strings are indeed UTF-16. However, >> in the real world, it turns out that JavaScript string functions and regular >> expressions are not required to support UTF-16, which can have surrogate >> pairs (multiple 16-bit quantities representing a single character). Because >> of this pre-existing condition, V8 is not in a position to do better, since >> this would break compatibility with other JavaScript engines. >> >> In most cases the spec tells us to treat strings as UCS-2, including most >> string operations like charAt and case conversion. This is not optional, >> handling surrogate pairs would actually be incorrect according to the spec. >> In a few cases (I can only think of 'eval' but there may be more) the spec >> says to treat strings as UTF-16. Again, this is not optional. >> >> As you say, for compatibility reasons we would be reluctant to switch any >> of the places we use UCS-2 to using UTF-16. However, for most operations I >> think the switch could be made without breaking any code on the web. For >> instance, JavaScriptCore uses UTF-16 for case conversion and it doesn't seem >> to be an issue. >> >> So now my question is whether people expect to be able to use/store UTF-16 >> in JavaScript even though this cannot be expected to work reliably for >> anything beyond the simplest read/write cases. I'm pondering whether I'd be >> doing my customers (client developers) a favor by using iconv to convert all >> text to UCS-2 before handing it to V8. This would give me an opportunity to >> detect that the input characters cannot be converted to UCS-2 before they >> ever got into V8 and caused subtle problems, possibly much farther down the >> road when it would be difficult to figure them out. >> >> This is an application specific question, it's very hard to give a general >> answer. If your program depends on string operations being correct >> according to the unicode standard, for instance that surrogate pairs are >> converted correctly to upper and lower case, then you're in trouble if your >> program is written in JavaScript. However, most of the language and even >> many string operations are unaffected by this, and the operations that are >> affected still use a consistent and reliable model -- it is just not the >> same as the unicode model. >> >> Thanks for the insight and thanks in advance for tolerating my thinking >> out loud here. >> The app in question is an application server in early development. When I >> say "customers (client developers)", I'm referring to the future. Happily, >> I'm not concerned about a large body of existing code. As well, I don't >> think I need to be concerned about militant JavaScript activists demanding >> UTF-16 in the few cases it's allowed. >> So, on one hand, I may have an opportunity now to prevent some heart-ache >> and head-scratching, and I'm somewhat inclined to be a proactive paranoid >> gatekeeper and require every string coming in from the outside world to >> convert with full fidelity to UCS-2, even if there are some cases (such as >> 'eval') which would tolerate UTF-16. >> On the other hand, I'm not so crazy as to think I want to implement every >> bit of this application server myself, and there may well be script >> libraries written primarily for use within web browsers which I would like >> to incorporate -- or anyway make it possible to incorporate. I suppose if >> the only strings they ever see are UCS-2, then they will work just fine, but >> if they have features which depend on UTF-16, those will break or cause >> breakage. I bet such features are few and far between, but I can't know >> conclusively. Hmmm. >> I suppose one approach would be to use UCS-2 until someone complains. :-) >> >> Pete Gontier <http://pete.gontier.org/> >> >> > > > > -- > Erik Corry, Software Engineer > Google Denmark ApS. CVR nr. 28 86 69 84 > c/o Philip & Partners, 7 Vognmagergade, P.O. Box 2227, DK-1018 Copenhagen K, > Denmark. > > > > --~--~---------~--~----~------------~-------~--~----~ v8-users mailing list [email protected] http://groups.google.com/group/v8-users -~----------~----~----~----~------~----~------~--~---
