On Oct 22, 2008, at 12:45 AM, Erik Corry wrote: > It's worth remembering that if you put UTF-16 into a JS string and > then get the UTF-16 out again then you will not lose any data. In a > sense V8 is transparent to UTF-16. It's only when you manipulate > the string in JS in certain ways that you risk 'corruption'. For > example if you use substring to cut a string in the middle of a > surrogate pair then the result will no longer be valid UTF-16.
That's exactly the situation I'm pondering. One one hand, chopping a surrogate pair in half will create problems which will probably be difficult for most scripters to detect/comprehend/diagnose/handle, especially if the string gets passed around for a while and maybe combined with others before the problem has symptoms. On the other hand, denying savvy scripters the ability to store UTF-16 at all will probably frustrate some. So far, when I import strings, there's an import object with a property representing the source encoding, and I've been assuming the destination encoding because I thought it should always be UTF-16. (Here's the example.) Perhaps I could have an additional property which specifies the destination encoding, and if it's absent, assume UCS-2, and if it's present, it can be either UCS-2 or UTF-16. That way, savvy scripters who really want to put UTF-16 into a JavaScript string have a way to do it, but the default behavior will be to assume the encoding, UCS-2, which is guaranteed to be free of surrogate pair subtleties. – Pete Gontier <http://pete.gontier.org/> --~--~---------~--~----~------------~-------~--~----~ v8-users mailing list [email protected] http://groups.google.com/group/v8-users -~----------~----~----~----~------~----~------~--~---
