It's worth remembering that if you put UTF-16 into a JS string and then get the UTF-16 out again then you will not lose any data. In a sense V8 is transparent to UTF-16. It's only when you manipulate the string in JS in certain ways that you risk 'corruption'. For example if you use substring to cut a string in the middle of a surrogate pair then the result will no longer be valid UTF-16.
On Wed, Oct 22, 2008 at 3:08 AM, Pete Gontier <[EMAIL PROTECTED]> wrote: > On Oct 21, 2008, at 1:45 AM, Christian Plesner Hansen wrote: > > An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript can be > expected to support UTF-16, even though it's not strictly required, and > comments in the V8 headers suggest that strings are indeed UTF-16. However, > in the real world, it turns out that JavaScript string functions and regular > expressions are not required to support UTF-16, which can have surrogate > pairs (multiple 16-bit quantities representing a single character). Because > of this pre-existing condition, V8 is not in a position to do better, since > this would break compatibility with other JavaScript engines. > > > In most cases the spec tells us to treat strings as UCS-2, including most > string operations like charAt and case conversion. This is not optional, > handling surrogate pairs would actually be incorrect according to the spec. > In a few cases (I can only think of 'eval' but there may be more) the spec > says to treat strings as UTF-16. Again, this is not optional. > > > As you say, for compatibility reasons we would be reluctant to switch any > of the places we use UCS-2 to using UTF-16. However, for most operations I > think the switch could be made without breaking any code on the web. For > instance, JavaScriptCore uses UTF-16 for case conversion and it doesn't seem > to be an issue. > > > So now my question is whether people expect to be able to use/store UTF-16 > in JavaScript even though this cannot be expected to work reliably for > anything beyond the simplest read/write cases. I'm pondering whether I'd be > doing my customers (client developers) a favor by using iconv to convert all > text to UCS-2 before handing it to V8. This would give me an opportunity to > detect that the input characters cannot be converted to UCS-2 before they > ever got into V8 and caused subtle problems, possibly much farther down the > road when it would be difficult to figure them out. > > > This is an application specific question, it's very hard to give a general > answer. If your program depends on string operations being correct > according to the unicode standard, for instance that surrogate pairs are > converted correctly to upper and lower case, then you're in trouble if your > program is written in JavaScript. However, most of the language and even > many string operations are unaffected by this, and the operations that are > affected still use a consistent and reliable model -- it is just not the > same as the unicode model. > > > Thanks for the insight and thanks in advance for tolerating my thinking out > loud here. > > The app in question is an application server in early development. When I > say "customers (client developers)", I'm referring to the future. Happily, > I'm not concerned about a large body of existing code. As well, I don't > think I need to be concerned about militant JavaScript activists demanding > UTF-16 in the few cases it's allowed. > > So, on one hand, I may have an opportunity now to prevent some heart-ache > and head-scratching, and I'm somewhat inclined to be a proactive paranoid > gatekeeper and require every string coming in from the outside world to > convert with full fidelity to UCS-2, even if there are some cases (such as > 'eval') which would tolerate UTF-16. > > On the other hand, I'm not so crazy as to think I want to implement every > bit of this application server myself, and there may well be script > libraries written primarily for use within web browsers which I would like > to incorporate -- or anyway make it possible to incorporate. I suppose if > the only strings they ever see are UCS-2, then they will work just fine, but > if they have features which depend on UTF-16, those will break or cause > breakage. I bet such features are few and far between, but I can't know > conclusively. Hmmm. > > I suppose one approach would be to use UCS-2 until someone complains. :-) > > > Pete Gontier <http://pete.gontier.org/> > > > > > -- Erik Corry, Software Engineer Google Denmark ApS. CVR nr. 28 86 69 84 c/o Philip & Partners, 7 Vognmagergade, P.O. Box 2227, DK-1018 Copenhagen K, Denmark. --~--~---------~--~----~------------~-------~--~----~ v8-users mailing list [email protected] http://groups.google.com/group/v8-users -~----------~----~----~----~------~----~------~--~---
