[v8-users] Re: some minor string confusion :)

Pete Gontier Tue, 21 Oct 2008 18:09:10 -0700

On Oct 21, 2008, at 1:45 AM, Christian Plesner Hansen wrote:

>> An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript  
>> can be expected to support UTF-16, even though it's not strictly  
>> required, and comments in the V8 headers suggest that strings are  
>> indeed UTF-16. However, in the real world, it turns out that  
>> JavaScript string functions and regular expressions are not  
>> required to support UTF-16, which can have surrogate pairs  
>> (multiple 16-bit quantities representing a single character).  
>> Because of this pre-existing condition, V8 is not in a position to  
>> do better, since this would break compatibility with other  
>> JavaScript engines.


> In most cases the spec tells us to treat strings as UCS-2, including  
> most string operations like charAt and case conversion.  This is not  
> optional, handling surrogate pairs would actually be incorrect  
> according to the spec.  In a few cases (I can only think of 'eval'  
> but there may be more) the spec says to treat strings as UTF-16.   
> Again, this is not optional.

> As you say, for compatibility reasons we would be reluctant to  
> switch any of the places we use UCS-2 to using UTF-16.  However, for  
> most operations I think the switch could be made without breaking  
> any code on the web.  For instance, JavaScriptCore uses UTF-16 for  
> case conversion and it doesn't seem to be an issue.

>> So now my question is whether people expect to be able to use/store  
>> UTF-16 in JavaScript even though this cannot be expected to work  
>> reliably for anything beyond the simplest read/write cases. I'm  
>> pondering whether I'd be doing my customers (client developers) a  
>> favor by using iconv to convert all text to UCS-2 before handing it  
>> to V8. This would give me an opportunity to detect that the input  
>> characters cannot be converted to UCS-2 before they ever got into  
>> V8 and caused subtle problems, possibly much farther down the road  
>> when it would be difficult to figure them out.

> This is an application specific question, it's very hard to give a  
> general answer.  If your program depends on string operations being  
> correct according to the unicode standard, for instance that  
> surrogate pairs are converted correctly to upper and lower case,  
> then you're in trouble if your program is written in JavaScript.   
> However, most of the language and even many string operations are  
> unaffected by this, and the operations that are affected still use a  
> consistent and reliable model -- it is just not the same as the  
> unicode model.

Thanks for the insight and thanks in advance for tolerating my  
thinking out loud here.

The app in question is an application server in early development.  
When I say "customers (client developers)", I'm referring to the  
future. Happily, I'm not concerned about a large body of existing  
code. As well, I don't think I need to be concerned about militant  
JavaScript activists demanding UTF-16 in the few cases it's allowed.

So, on one hand, I may have an opportunity now to prevent some heart- 
ache and head-scratching, and I'm somewhat inclined to be a proactive  
paranoid gatekeeper and require every string coming in from the  
outside world to convert with full fidelity to UCS-2, even if there  
are some cases (such as 'eval') which would tolerate UTF-16.

On the other hand, I'm not so crazy as to think I want to implement  
every bit of this application server myself, and there may well be  
script libraries written primarily for use within web browsers which I  
would like to incorporate -- or anyway make it possible to  
incorporate. I suppose if the only strings they ever see are UCS-2,  
then they will work just fine, but if they have features which depend  
on UTF-16, those will break or cause breakage. I bet such features are  
few and far between, but I can't know conclusively. Hmmm.

I suppose one approach would be to use UCS-2 until someone  
complains. :-)


Pete Gontier <http://pete.gontier.org/>


--~--~---------~--~----~------------~-------~--~----~
v8-users mailing list
[email protected]
http://groups.google.com/group/v8-users
-~----------~----~----~----~------~----~------~--~---

[v8-users] Re: some minor string confusion :)

Reply via email to