On Oct 22, 2008, at 12:45 AM, Erik Corry wrote:

> It's worth remembering that if you put UTF-16 into a JS string and  
> then get the UTF-16 out again then you will not lose any data.  In a  
> sense V8 is transparent to UTF-16.  It's only when you manipulate  
> the string in JS in certain ways that you risk 'corruption'.  For  
> example if you use substring to cut a string in the middle of a  
> surrogate pair then the result will no longer be valid UTF-16.


That's exactly the situation I'm pondering. One one hand, chopping a  
surrogate pair in half will create problems which will probably be  
difficult for most scripters to detect/comprehend/diagnose/handle,  
especially if the string gets passed around for a while and maybe  
combined with others before the problem has symptoms. On the other  
hand, denying savvy scripters the ability to store UTF-16 at all will  
probably frustrate some.

So far, when I import strings, there's an import object with a  
property representing the source encoding, and I've been assuming the  
destination encoding because I thought it should always be UTF-16.  
(Here's the example.) Perhaps I could have an additional property  
which specifies the destination encoding, and if it's absent, assume  
UCS-2, and if it's present, it can be either UCS-2 or UTF-16. That  
way, savvy scripters who really want to put UTF-16 into a JavaScript  
string have a way to do it, but the default behavior will be to assume  
the encoding, UCS-2, which is guaranteed to be free of surrogate pair  
subtleties.


– Pete Gontier <http://pete.gontier.org/>




--~--~---------~--~----~------------~-------~--~----~
v8-users mailing list
[email protected]
http://groups.google.com/group/v8-users
-~----------~----~----~----~------~----~------~--~---

Reply via email to