On 09/12/2003 07:00, Arcane Jill wrote:
Hmm. Now here's some C++ source code (syntax colored as Philippe
suggests, to imply that the text editor understands C++ at least well
:enough to color it)
int n = wcslen(L"café");
(That's int n = wcslen(L"café"); for those without HTML email)
The L prefix on a string literal makes it a wide-character string, and
wcslen() is simply a wide-character version of strlen(). (There is no
guarantee that "wide character" means "Unicode character", but let's
just assume that it does, for the moment).
So, should n equal four or five? The answer would appear to depend on
whether or not the source file was saved in NFC or NFD format.
No, surely not. If the wcslen() function is fully Unicode conformant, it
should give the same output whatever the canonically equivalent form of
its input. That more or less implies that it should normalise its input.
(One can imagine a second parameter specifying whether NFC or NFD is
required.) This makes the issue one not for the text editor but for the
programming language or its string handling library.
There is more to consider than just how and whether a text editor
normalizes. If a text editor is capable of dealing with Unicode text,
perhaps it should also be able to explicitly DISPLAY the actual
composition form of every glyph. The question I posed in the previous
paragraph should ideally be obvious by sight - if you see four
characters, there are four characters; if you see five characters,
there are five characters. This implies that such a text editor should
display NFD text as separate glyphs for each character.
On the other hand, such a text editor must also acknowledge that "é"
and "e + U+0301" are actually equivalent. The /intention/ of canonical
equivalence is that the glyphs should display the same - otherwise
we'd need precomposed versions of, well, everything. So in other
contexts, is should display them the same.
The Unicode standard does allow for special display modes in which the
exact underlying string, including control characters, is made visible.
Yuk. That's a lot to think about for anyone considering writing a
programmers' text editor with /serious/ Unicode support.
Jill
Simply allow the text editor to save as either NFC or NFD, and let the
programming language sort out the rest.
--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/