Too true. That's why I mentioned, though, that I do not have an existing body of code to support. You guys must work within Chrome and Chrome must work with squillions of existing web pages. So I understand why this would be a big consideration for you, but I suspect/hope that I have an opportunity here to document JavaScript's odd hybrid encoding approach to Unicode and steer people toward UCS-2 unless they really need UTF-16, and if they do then they may need to do extra work or at least be very careful to avoid logic which could cause them a lot of debugging time.
On Oct 22, 2008, at 1:20 AM, Christian Plesner Hansen wrote: > Note also that you can't generally tell whether a program will behave > correctly under UCS-2. For instance, consider this program: > > var dci = String.fromCharCode(0xD801) + String.fromCharCode(0xDC00); > var dli = dci.toLowerCase(); > print(dci == dli); > > (dci is a deseret capital I, represented by a surrogate pair). Under > UCS-2 this program prints true, under UTF-16 it prints false. > Programs like this cannote be detected reliably. > > On Wed, Oct 22, 2008 at 9:45 AM, Erik Corry <[EMAIL PROTECTED]> > wrote: >> It's worth remembering that if you put UTF-16 into a JS string and >> then get >> the UTF-16 out again then you will not lose any data. In a sense >> V8 is >> transparent to UTF-16. It's only when you manipulate the string in >> JS in >> certain ways that you risk 'corruption'. For example if you use >> substring >> to cut a string in the middle of a surrogate pair then the result >> will no >> longer be valid UTF-16. >> >> On Wed, Oct 22, 2008 at 3:08 AM, Pete Gontier <[EMAIL PROTECTED]> >> wrote: >>> >>> On Oct 21, 2008, at 1:45 AM, Christian Plesner Hansen wrote: >>> >>> An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript >>> can be >>> expected to support UTF-16, even though it's not strictly >>> required, and >>> comments in the V8 headers suggest that strings are indeed UTF-16. >>> However, >>> in the real world, it turns out that JavaScript string functions >>> and regular >>> expressions are not required to support UTF-16, which can have >>> surrogate >>> pairs (multiple 16-bit quantities representing a single >>> character). Because >>> of this pre-existing condition, V8 is not in a position to do >>> better, since >>> this would break compatibility with other JavaScript engines. >>> >>> In most cases the spec tells us to treat strings as UCS-2, >>> including most >>> string operations like charAt and case conversion. This is not >>> optional, >>> handling surrogate pairs would actually be incorrect according to >>> the spec. >>> In a few cases (I can only think of 'eval' but there may be more) >>> the spec >>> says to treat strings as UTF-16. Again, this is not optional. >>> >>> As you say, for compatibility reasons we would be reluctant to >>> switch any >>> of the places we use UCS-2 to using UTF-16. However, for most >>> operations I >>> think the switch could be made without breaking any code on the >>> web. For >>> instance, JavaScriptCore uses UTF-16 for case conversion and it >>> doesn't seem >>> to be an issue. >>> >>> So now my question is whether people expect to be able to use/ >>> store UTF-16 >>> in JavaScript even though this cannot be expected to work reliably >>> for >>> anything beyond the simplest read/write cases. I'm pondering >>> whether I'd be >>> doing my customers (client developers) a favor by using iconv to >>> convert all >>> text to UCS-2 before handing it to V8. This would give me an >>> opportunity to >>> detect that the input characters cannot be converted to UCS-2 >>> before they >>> ever got into V8 and caused subtle problems, possibly much farther >>> down the >>> road when it would be difficult to figure them out. >>> >>> This is an application specific question, it's very hard to give a >>> general >>> answer. If your program depends on string operations being correct >>> according to the unicode standard, for instance that surrogate >>> pairs are >>> converted correctly to upper and lower case, then you're in >>> trouble if your >>> program is written in JavaScript. However, most of the language >>> and even >>> many string operations are unaffected by this, and the operations >>> that are >>> affected still use a consistent and reliable model -- it is just >>> not the >>> same as the unicode model. >>> >>> Thanks for the insight and thanks in advance for tolerating my >>> thinking >>> out loud here. >>> The app in question is an application server in early development. >>> When I >>> say "customers (client developers)", I'm referring to the future. >>> Happily, >>> I'm not concerned about a large body of existing code. As well, I >>> don't >>> think I need to be concerned about militant JavaScript activists >>> demanding >>> UTF-16 in the few cases it's allowed. >>> So, on one hand, I may have an opportunity now to prevent some >>> heart-ache >>> and head-scratching, and I'm somewhat inclined to be a proactive >>> paranoid >>> gatekeeper and require every string coming in from the outside >>> world to >>> convert with full fidelity to UCS-2, even if there are some cases >>> (such as >>> 'eval') which would tolerate UTF-16. >>> On the other hand, I'm not so crazy as to think I want to >>> implement every >>> bit of this application server myself, and there may well be script >>> libraries written primarily for use within web browsers which I >>> would like >>> to incorporate -- or anyway make it possible to incorporate. I >>> suppose if >>> the only strings they ever see are UCS-2, then they will work just >>> fine, but >>> if they have features which depend on UTF-16, those will break or >>> cause >>> breakage. I bet such features are few and far between, but I can't >>> know >>> conclusively. Hmmm. >>> I suppose one approach would be to use UCS-2 until someone >>> complains. :-) >>> >>> Pete Gontier <http://pete.gontier.org/> >>> >>> >> >> >> >> -- >> Erik Corry, Software Engineer >> Google Denmark ApS. CVR nr. 28 86 69 84 >> c/o Philip & Partners, 7 Vognmagergade, P.O. Box 2227, DK-1018 >> Copenhagen K, >> Denmark. >> >>> >> > > > – Pete Gontier <http://pete.gontier.org/> --~--~---------~--~----~------------~-------~--~----~ v8-users mailing list [email protected] http://groups.google.com/group/v8-users -~----------~----~----~----~------~----~------~--~---
