[v8-users] Re: some minor string confusion :)

Pete Gontier Sun, 26 Oct 2008 11:19:27 -0700

Too true. That's why I mentioned, though, that I do not have an  
existing body of code to support. You guys must work within Chrome and  
Chrome must work with squillions of existing web pages. So I  
understand why this would be a big consideration for you, but I  
suspect/hope that I have an opportunity here to document JavaScript's  
odd hybrid encoding approach to Unicode and steer people toward UCS-2  
unless they really need UTF-16, and if they do then they may need to  
do extra work or at least be very careful to avoid logic which could  
cause them a lot of debugging time.


On Oct 22, 2008, at 1:20 AM, Christian Plesner Hansen wrote:

> Note also that you can't generally tell whether a program will behave
> correctly under UCS-2.  For instance, consider this program:
>
> var dci = String.fromCharCode(0xD801) + String.fromCharCode(0xDC00);
> var dli = dci.toLowerCase();
> print(dci == dli);
>
> (dci is a deseret capital I, represented by a surrogate pair).  Under
> UCS-2 this program prints true, under UTF-16 it prints false.
> Programs like this cannote be detected reliably.
>
> On Wed, Oct 22, 2008 at 9:45 AM, Erik Corry <[EMAIL PROTECTED]>  
> wrote:
>> It's worth remembering that if you put UTF-16 into a JS string and  
>> then get
>> the UTF-16 out again then you will not lose any data.  In a sense  
>> V8 is
>> transparent to UTF-16.  It's only when you manipulate the string in  
>> JS in
>> certain ways that you risk 'corruption'.  For example if you use  
>> substring
>> to cut a string in the middle of a surrogate pair then the result  
>> will no
>> longer be valid UTF-16.
>>
>> On Wed, Oct 22, 2008 at 3:08 AM, Pete Gontier <[EMAIL PROTECTED]>  
>> wrote:
>>>
>>> On Oct 21, 2008, at 1:45 AM, Christian Plesner Hansen wrote:
>>>
>>> An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript  
>>> can be
>>> expected to support UTF-16, even though it's not strictly  
>>> required, and
>>> comments in the V8 headers suggest that strings are indeed UTF-16.  
>>> However,
>>> in the real world, it turns out that JavaScript string functions  
>>> and regular
>>> expressions are not required to support UTF-16, which can have  
>>> surrogate
>>> pairs (multiple 16-bit quantities representing a single  
>>> character). Because
>>> of this pre-existing condition, V8 is not in a position to do  
>>> better, since
>>> this would break compatibility with other JavaScript engines.
>>>
>>> In most cases the spec tells us to treat strings as UCS-2,  
>>> including most
>>> string operations like charAt and case conversion.  This is not  
>>> optional,
>>> handling surrogate pairs would actually be incorrect according to  
>>> the spec.
>>> In a few cases (I can only think of 'eval' but there may be more)  
>>> the spec
>>> says to treat strings as UTF-16.  Again, this is not optional.
>>>
>>> As you say, for compatibility reasons we would be reluctant to  
>>> switch any
>>> of the places we use UCS-2 to using UTF-16.  However, for most  
>>> operations I
>>> think the switch could be made without breaking any code on the  
>>> web.  For
>>> instance, JavaScriptCore uses UTF-16 for case conversion and it  
>>> doesn't seem
>>> to be an issue.
>>>
>>> So now my question is whether people expect to be able to use/ 
>>> store UTF-16
>>> in JavaScript even though this cannot be expected to work reliably  
>>> for
>>> anything beyond the simplest read/write cases. I'm pondering  
>>> whether I'd be
>>> doing my customers (client developers) a favor by using iconv to  
>>> convert all
>>> text to UCS-2 before handing it to V8. This would give me an  
>>> opportunity to
>>> detect that the input characters cannot be converted to UCS-2  
>>> before they
>>> ever got into V8 and caused subtle problems, possibly much farther  
>>> down the
>>> road when it would be difficult to figure them out.
>>>
>>> This is an application specific question, it's very hard to give a  
>>> general
>>> answer.  If your program depends on string operations being correct
>>> according to the unicode standard, for instance that surrogate  
>>> pairs are
>>> converted correctly to upper and lower case, then you're in  
>>> trouble if your
>>> program is written in JavaScript.  However, most of the language  
>>> and even
>>> many string operations are unaffected by this, and the operations  
>>> that are
>>> affected still use a consistent and reliable model -- it is just  
>>> not the
>>> same as the unicode model.
>>>
>>> Thanks for the insight and thanks in advance for tolerating my  
>>> thinking
>>> out loud here.
>>> The app in question is an application server in early development.  
>>> When I
>>> say "customers (client developers)", I'm referring to the future.  
>>> Happily,
>>> I'm not concerned about a large body of existing code. As well, I  
>>> don't
>>> think I need to be concerned about militant JavaScript activists  
>>> demanding
>>> UTF-16 in the few cases it's allowed.
>>> So, on one hand, I may have an opportunity now to prevent some  
>>> heart-ache
>>> and head-scratching, and I'm somewhat inclined to be a proactive  
>>> paranoid
>>> gatekeeper and require every string coming in from the outside  
>>> world to
>>> convert with full fidelity to UCS-2, even if there are some cases  
>>> (such as
>>> 'eval') which would tolerate UTF-16.
>>> On the other hand, I'm not so crazy as to think I want to  
>>> implement every
>>> bit of this application server myself, and there may well be script
>>> libraries written primarily for use within web browsers which I  
>>> would like
>>> to incorporate -- or anyway make it possible to incorporate. I  
>>> suppose if
>>> the only strings they ever see are UCS-2, then they will work just  
>>> fine, but
>>> if they have features which depend on UTF-16, those will break or  
>>> cause
>>> breakage. I bet such features are few and far between, but I can't  
>>> know
>>> conclusively. Hmmm.
>>> I suppose one approach would be to use UCS-2 until someone  
>>> complains. :-)
>>>
>>> Pete Gontier <http://pete.gontier.org/>
>>>
>>>
>>
>>
>>
>> --
>> Erik Corry, Software Engineer
>> Google Denmark ApS.  CVR nr. 28 86 69 84
>> c/o Philip & Partners, 7 Vognmagergade, P.O. Box 2227, DK-1018  
>> Copenhagen K,
>> Denmark.
>>
>>>
>>
>
> >


– Pete Gontier <http://pete.gontier.org/>




--~--~---------~--~----~------------~-------~--~----~
v8-users mailing list
[email protected]
http://groups.google.com/group/v8-users
-~----------~----~----~----~------~----~------~--~---

[v8-users] Re: some minor string confusion :)

Reply via email to