> Hmm. Now here's some C++ source code (syntax colored as > Philippe suggests, to imply that the text editor understands > C++ at least well :enough to color it) > > int n = wcslen(L"café"); > > (That's int n = wcslen(L"café"); for those without HTML email) > > The L prefix on a string literal makes it a wide-character > string, and wcslen() is simply a wide-character version of > strlen(). (There is no guarantee that "wide character" means > "Unicode character", but let's just assume that it does, for > the moment).
Even assuming that you can assume that "wide characters" are Unicode, you have not yet assumed in what kind of UTF they are. (Don't assume I deliberately making calembours :-) The only thing that the C(++) standards say about type "wchar_t" is that it is not smaller that type "char", so a "wide character" could well be a byte, and a "wide character string" could well be UTF-8, or even ASCII. > So, should n equal four or five? Why not six? If, in our C(++) compiler, type "wchar_t" is an alias for "char", and "wide character strings" are encoded in UTF-8, and the "é" is decomposed, then n will be equal to 6. > The answer would appear to depend on whether or not the > source file was saved in NFC or NFD format. The answer is: int n = wcslen(L"café"); That's why you take the burden to call the "wcslen" library function rather than assuming a hard-coded value such as: int n = 4; // the length of string "café" > There is more to consider than just how and whether a text > editor normalizes. Whatever the editor does, what if then the *compiler* normalizes it? The source file and the compiled object file are not necessarily in the same encoding and/or normalization. A certain compiler could accept a certain range of input encodings (maybe declared with command-line parameter) and convert them all in a certain internal representation in the compiler object file (e.g., Unicode expressed in a particular UTF and with a particular normalization). That's why library functions such as "strlen" or "wcslen" exist. You don't need to bother what these functions will return in a particular compiler or environment, as far as the following code is guaranteed to work: const wchar_t * myText = L"café"; wchar_t * myBuffer = malloc(sizeof(wchar_t) * (wcslen(myText) + 1)); if (myBuffer != NULL) { wcscpy(myBuffer, myText); } > If a text editor is capable of dealing with Unicode text, > perhaps it should also be able to explicitly DISPLAY the > actual composition form of every glyph. Against, this is not possible nor desirable, because a text editor is not supposed to know how the compiler (or its runtime libraries) will transform string literals. > The question I posed in the previous paragraph should > ideally be obvious by sight - if you see four characters, > there are four characters; if you see five characters, there > are five characters. Provided that you can define what a "character" is... After a few years reading this mailing list, I haven't seen a single acceptable definition of "character". Moreover, I matured the impression that it is totally irrelevant to have such a definition: - as an end user, I am interested in a higher level kind of objects (let's call them "graphemes", i.e. those things I see on the screen and I can interact with my mouse); - as a programmer, I am interested in a lower lever kind of objects (let's call them "encoding units", i.e. those things that I count when I have to allocate memory for a string, or the like). The term "character" is in a sort of conceptual limbo which makes it pretty useless for everybody, IMHO. _ Marco