My understanding of this stuff recently deepened a little bit after I read up on the behavior to be expected from strings and regular expressions. Consequently, a potential course of action has presented itself to me. I am not the world's foremost expert on these topics, so feel free to correct me on any aspect of the below.
An optimistic reading of ECMA-262 4.3.16 suggests that JavaScript can be expected to support UTF-16, even though it's not strictly required, and comments in the V8 headers suggest that strings are indeed UTF-16. However, in the real world, it turns out that JavaScript string functions and regular expressions are not required to support UTF-16, which can have surrogate pairs (multiple 16-bit quantities representing a single character). Because of this pre-existing condition, V8 is not in a position to do better, since this would break compatibility with other JavaScript engines. The net effect here is that V8 (and all other JavaScript engines) actually supports UCS-2, which is a proper subset of UTF-16. (Both encodings support the Basic Multilingual Plane of Unicode.) This is mildly bad news, but I am happy to understand it, and things could be a lot worse. (I imagine ECMA knows about this problem and is thinking about a solution.) So now my question is whether people expect to be able to use/store UTF-16 in JavaScript even though this cannot be expected to work reliably for anything beyond the simplest read/write cases. I'm pondering whether I'd be doing my customers (client developers) a favor by using iconv to convert all text to UCS-2 before handing it to V8. This would give me an opportunity to detect that the input characters cannot be converted to UCS-2 before they ever got into V8 and caused subtle problems, possibly much farther down the road when it would be difficult to figure them out. Pete Gontier <http://pete.gontier.org/> On Oct 5, 2008, at 8:07 PM, Pete Gontier wrote: > I was spelunking the header just now and ran across some comments > which made specific reference to UTF-16, so that's good. It would > still be useful to know which de/composition to expect. It might > seem needlessly specific, but because others have done it, it's > useful to know. > > Pete Gontier <http://pete.gontier.org/> > > > > On Oct 4, 2008, at 10:53 AM, Pete Gontier wrote: > >> It sounds as if I didn't ask my question very well. Let me try >> again. I'm going to explain some things as if you didn't know them >> even though you obviously do just to make it clear what I'm asking >> about. >> >> Every string has an encoding: UCS-2, ASCII, UTF-8, Shift JIS, >> UTF-16, etc. Unicode strings are also either composed or decomposed >> in one of several ways. >> >> ECMA-262 4.3.16 doesn't specify an encoding for JavaScript strings. >> It specifies that strings are arrays of 16-bit integers. It doesn't >> specify semantics for those integers. It says each of these >> integers is "usually" UTF-16 (without suggesting a de/composition) >> but doesn't specify it. >> >> Obviously, V8 is free to do whatever it likes with strings >> internally in order to get its job done. However, a couple of >> questions remain from an interface standpoint: >> >> What encoding and de/composition can JavaScript programs expect? (I >> expect this will be dictated by the expectations of programs such >> as Gmail.) >> >> What encoding and de/composition can clients of v8::String::Write, >> v8::String::ExternalStringResource, and v8::String::Value expect? >> (I expect this will be dictated by the expectations of programs >> such as Chrome.) >> >> I am not a Unicode expert, so I recognize these questions may seem >> silly on some level. >> >> >> Pete Gontier <http://pete.gontier.org/> >> >> >> >> On Oct 4, 2008, at 6:27 AM, Søren Gjesse wrote: >> >>> Inside V8 there is a number of different string representations. >>> The basic ones are ascii representation (AsciiString) and two byte >>> representation (TwoByteString) where the first is used when all >>> characters are ASCII and therefore only one byte is required to >>> store each character. Besides that V8 has concatenated strings >>> (ConsString) and string slices (SlicedString). Concatenated >>> strings points to two other strings which have been concatenated >>> but the concatenated string is not materialized whereas a string >>> slice points to a part of an existing string. V8 tries to make the >>> best choice when making new strings and there are a number of >>> rules to materialize (flatten) concatenated strings when certain >>> operations are preformed. Finally there are also external strings >>> in ascii and two byte variants (ExternalAsciiString and >>> ExternalTwoByteString) these are strings which are not present in >>> the V8 heap but references to strings in C++ land added through >>> the API. In Chrome external strings are used when adding the >>> JavaScript source code from web pages to V8 without making an >>> additional copy. >>> >>> Regards, >>> Søren >>> >>> On Sat, Oct 4, 2008 at 3:29 AM, Pete Gontier <[EMAIL PROTECTED]> >>> wrote: >>> ECMA-262 4.3.16 allows a fair amount of encoding flexibility. >>> >>> Has V8 committed to any particular encoding? >>> >>> >>> Pete Gontier <http://pete.gontier.org/> >>> >>> >>> >>> On Oct 2, 2008, at 11:59 PM, Søren Gjesse wrote: >>> >>>> There is only one String type in V8 which is v8::String. You can >>>> create an new String in a number of ways with v8::String::New >>>> most commonly used. The classes v8::String::Utf8Value and >>>> v8::String::Value (and v8::String::AsciiValue which is mainly for >>>> testing) are used to pull out the string as a char* or uint16_t* >>>> to be used in C++, e.g.: >>>> >>>> v8::Handle<v8::String> str = v8::String::New("print") >>>> v8::String::Utf8Value s(str); >>>> printf("%s", *s); >>>> >>>> Note that v8::String represents the string value (ECMA-262 >>>> 4.3.16). To create a string object (ECMA-262 4.3.18) use >>>> NewInstance on the String function. >>>> >>>> Regards, >>>> Søren >>>> >>>> >>>> On Thu, Oct 2, 2008 at 11:00 PM, ondras <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> Hi again, >>>> >>>> I have some troubles understanding all those String types in V8. >>>> What >>>> exactly is the purpose and difference between v8::String::New, >>>> v8::String::AsciiValue and v8::String::Utf8Value? How should I use >>>> these and when? >>>> >>>> Thanks for clarification, >>>> Ondrej >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >>> >> > --~--~---------~--~----~------------~-------~--~----~ v8-users mailing list [email protected] http://groups.google.com/group/v8-users -~----------~----~----~----~------~----~------~--~---
