RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

Marco Cimarosti Tue, 09 Dec 2003 12:03:53 -0800

> Hmm. Now here's some C++ source code (syntax colored as 
> Philippe suggests, to imply that the text editor understands 
> C++ at least well :enough to color it)
> 
> int n = wcslen(L"café");
> 
> (That's int n = wcslen(L"café"); for those without HTML email)
> 
> The L prefix on a string literal makes it a wide-character 
> string, and wcslen() is simply a wide-character version of 
> strlen(). (There is no guarantee that "wide character" means 
> "Unicode character", but let's just assume that it does, for 
> the moment).


Even assuming that you can assume that "wide characters" are Unicode, you
have not yet assumed in what kind of UTF they are. (Don't assume I
deliberately making calembours :-)

The only thing that the C(++) standards say about type "wchar_t" is that it
is not smaller that type "char", so a "wide character" could well be a byte,
and a "wide character string" could well be UTF-8, or even ASCII.

> So, should n equal four or five?

Why not six?

If, in our C(++) compiler, type "wchar_t" is an alias for "char", and "wide
character strings" are encoded in UTF-8, and the "é" is decomposed, then n
will be equal to 6.

> The answer would appear to depend on whether or not the
> source file was saved in NFC or NFD format.

The answer is:

        int n = wcslen(L"café");

That's why you take the burden to call the "wcslen" library function rather
than assuming a hard-coded value such as:

        int n = 4;      // the length of string "café"

> There is more to consider than just how and whether a text 
> editor normalizes.

Whatever the editor does, what if then the *compiler* normalizes it?

The source file and the compiled object file are not necessarily in the same
encoding and/or normalization.

A certain compiler could accept a certain range of input encodings (maybe
declared with command-line parameter) and convert them all in a certain
internal representation in the compiler object file (e.g., Unicode expressed
in a particular UTF and with a particular normalization).

That's why library functions such as "strlen" or "wcslen" exist. You don't
need to bother what these functions will return in a particular compiler or
environment, as far as the following code is guaranteed to work:

        const wchar_t * myText = L"café";
        wchar_t * myBuffer = malloc(sizeof(wchar_t) * (wcslen(myText) + 1));
        if (myBuffer != NULL)
        {
                wcscpy(myBuffer, myText);
        }

> If a text editor is capable of dealing with Unicode text,
> perhaps it should also be able to explicitly DISPLAY the
> actual composition form of every glyph.

Against, this is not possible nor desirable, because a text editor is not
supposed to know how the compiler (or its runtime libraries) will transform
string literals.

> The question I posed in the previous paragraph should 
> ideally be obvious by sight - if you see four characters, 
> there are four characters; if you see five characters, there 
> are five characters.

Provided that you can define what a "character" is... After a few years
reading this mailing list, I haven't seen a single acceptable definition of
"character".

Moreover, I matured the impression that it is totally irrelevant to have
such a definition:

- as an end user, I am interested in a higher level kind of objects (let's
call them "graphemes", i.e. those things I see on the screen and I can
interact with my mouse);

- as a programmer, I am interested in a lower lever kind of objects (let's
call them "encoding units", i.e. those things that I count when I have to
allocate memory for a string, or the like).

The term "character" is in a sort of conceptual limbo which makes it pretty
useless for everybody, IMHO.

_ Marco

RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

Reply via email to