[v8-users] Re: some minor string confusion :)

Pete Gontier Mon, 20 Oct 2008 16:14:06 -0700

My understanding of this stuff recently deepened a little bit after I  
read up on the behavior to be expected from strings and regular  
expressions. Consequently, a potential course of action has presented  
itself to me. I am not the world's foremost expert on these topics, so  
feel free to correct me on any aspect of the below.


An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript can  
be expected to support UTF-16, even though it's not strictly required,  
and comments in the V8 headers suggest that strings are indeed UTF-16.  
However, in the real world, it turns out that JavaScript string  
functions and regular expressions are not required to support UTF-16,  
which can have surrogate pairs (multiple 16-bit quantities  
representing a single character). Because of this pre-existing  
condition, V8 is not in a position to do better, since this would  
break compatibility with other JavaScript engines.

The net effect here is that V8 (and all other JavaScript engines)  
actually supports UCS-2, which is a proper subset of UTF-16. (Both  
encodings support the Basic Multilingual Plane of Unicode.) This is  
mildly bad news, but I am happy to understand it, and things could be  
a lot worse. (I imagine ECMA knows about this problem and is thinking  
about a solution.)

So now my question is whether people expect to be able to use/store  
UTF-16 in JavaScript even though this cannot be expected to work  
reliably for anything beyond the simplest read/write cases. I'm  
pondering whether I'd be doing my customers (client developers) a  
favor by using iconv to convert all text to UCS-2 before handing it to  
V8. This would give me an opportunity to detect that the input  
characters cannot be converted to UCS-2 before they ever got into V8  
and caused subtle problems, possibly much farther down the road when  
it would be difficult to figure them out.


Pete Gontier <http://pete.gontier.org/>



On Oct 5, 2008, at 8:07 PM, Pete Gontier wrote:

> I was spelunking the header just now and ran across some comments  
> which made specific reference to UTF-16, so that's good. It would  
> still be useful to know which de/composition to expect. It might  
> seem needlessly specific, but because others have done it, it's  
> useful to know.
>
> Pete Gontier <http://pete.gontier.org/>
>
>
>
> On Oct 4, 2008, at 10:53 AM, Pete Gontier wrote:
>
>> It sounds as if I didn't ask my question very well. Let me try  
>> again. I'm going to explain some things as if you didn't know them  
>> even though you obviously do just to make it clear what I'm asking  
>> about.
>>
>> Every string has an encoding: UCS-2, ASCII, UTF-8, Shift JIS,  
>> UTF-16, etc. Unicode strings are also either composed or decomposed  
>> in one of several ways.
>>
>> ECMA-262 4.3.16 doesn't specify an encoding for JavaScript strings.  
>> It specifies that strings are arrays of 16-bit integers. It doesn't  
>> specify semantics for those integers. It says each of these  
>> integers is "usually" UTF-16 (without suggesting a de/composition)  
>> but doesn't specify it.
>>
>> Obviously, V8 is free to do whatever it likes with strings  
>> internally in order to get its job done. However, a couple of  
>> questions remain from an interface standpoint:
>>
>> What encoding and de/composition can JavaScript programs expect? (I  
>> expect this will be dictated by the expectations of programs such  
>> as Gmail.)
>>
>> What encoding and de/composition can clients of v8::String::Write,  
>> v8::String::ExternalStringResource, and v8::String::Value expect?  
>> (I expect this will be dictated by the expectations of programs  
>> such as Chrome.)
>>
>> I am not a Unicode expert, so I recognize these questions may seem  
>> silly on some level.
>>
>>
>> Pete Gontier <http://pete.gontier.org/>
>>
>>
>>
>> On Oct 4, 2008, at 6:27 AM, Søren Gjesse wrote:
>>
>>> Inside V8 there is a number of different string representations.  
>>> The basic ones are ascii representation (AsciiString) and two byte  
>>> representation (TwoByteString) where the first is used when all  
>>> characters are ASCII and therefore only one byte is required to  
>>> store each character. Besides that V8 has concatenated strings  
>>> (ConsString) and string slices (SlicedString). Concatenated  
>>> strings points to two other strings which have been concatenated  
>>> but the concatenated string is not materialized whereas a string  
>>> slice points to a part of an existing string. V8 tries to make the  
>>> best choice when making new strings and there are a number of  
>>> rules to materialize (flatten) concatenated strings when certain  
>>> operations are preformed. Finally there are also external strings  
>>> in ascii and two byte variants (ExternalAsciiString and  
>>> ExternalTwoByteString) these are strings which are not present in  
>>> the V8 heap but references to strings in C++ land added through  
>>> the API. In Chrome external strings are used when adding the  
>>> JavaScript source code from web pages to V8 without making an  
>>> additional copy.
>>>
>>> Regards,
>>> Søren
>>>
>>> On Sat, Oct 4, 2008 at 3:29 AM, Pete Gontier <[EMAIL PROTECTED]>  
>>> wrote:
>>> ECMA-262 4.3.16 allows a fair amount of encoding flexibility.
>>>
>>> Has V8 committed to any particular encoding?
>>>
>>>
>>> Pete Gontier <http://pete.gontier.org/>
>>>
>>>
>>>
>>> On Oct 2, 2008, at 11:59 PM, Søren Gjesse wrote:
>>>
>>>> There is only one String type in V8 which is v8::String. You can  
>>>> create an new String in a number of ways with v8::String::New  
>>>> most commonly used. The classes  v8::String::Utf8Value and  
>>>> v8::String::Value (and v8::String::AsciiValue which is mainly for  
>>>> testing) are used to pull out the string as a char* or uint16_t*  
>>>> to be used in C++, e.g.:
>>>>
>>>>   v8::Handle<v8::String> str = v8::String::New("print")
>>>>   v8::String::Utf8Value s(str);
>>>>   printf("%s", *s);
>>>>
>>>> Note that v8::String represents the string value (ECMA-262  
>>>> 4.3.16). To create a string object (ECMA-262 4.3.18) use  
>>>> NewInstance on the String function.
>>>>
>>>> Regards,
>>>> Søren
>>>>
>>>>
>>>> On Thu, Oct 2, 2008 at 11:00 PM, ondras <[EMAIL PROTECTED]>  
>>>> wrote:
>>>>
>>>> Hi again,
>>>>
>>>> I have some troubles understanding all those String types in V8.  
>>>> What
>>>> exactly is the purpose and difference between v8::String::New,
>>>> v8::String::AsciiValue and v8::String::Utf8Value? How should I use
>>>> these and when?
>>>>
>>>> Thanks for clarification,
>>>> Ondrej
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>> >>>
>>
>


--~--~---------~--~----~------------~-------~--~----~
v8-users mailing list
[email protected]
http://groups.google.com/group/v8-users
-~----------~----~----~----~------~----~------~--~---

[v8-users] Re: some minor string confusion :)

Reply via email to