Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

Peter Kirk Tue, 09 Dec 2003 11:04:40 -0800

On 09/12/2003 07:00, Arcane Jill wrote:

Hmm. Now here's some C++ source code (syntax colored as Philippe suggests, to imply that the text editor understands C++ at least well :enough to color it)

int n = wcslen(L"café");

(That's int n = wcslen(L"café"); for those without HTML email)

The L prefix on a string literal makes it a wide-character string, and wcslen() is simply a wide-character version of strlen(). (There is no guarantee that "wide character" means "Unicode character", but let's just assume that it does, for the moment).

So, should n equal four or five? The answer would appear to depend on whether or not the source file was saved in NFC or NFD format.

No, surely not. If the wcslen() function is fully Unicode conformant, it should give the same output whatever the canonically equivalent form of its input. That more or less implies that it should normalise its input. (One can imagine a second parameter specifying whether NFC or NFD is required.) This makes the issue one not for the text editor but for the programming language or its string handling library.

There is more to consider than just how and whether a text editor normalizes. If a text editor is capable of dealing with Unicode text, perhaps it should also be able to explicitly DISPLAY the actual composition form of every glyph. The question I posed in the previous paragraph should ideally be obvious by sight - if you see four characters, there are four characters; if you see five characters, there are five characters. This implies that such a text editor should display NFD text as separate glyphs for each character.

On the other hand, such a text editor must also acknowledge that "é" and "e + U+0301" are actually equivalent. The /intention/ of canonical equivalence is that the glyphs should display the same - otherwise we'd need precomposed versions of, well, everything. So in other contexts, is should display them the same.

The Unicode standard does allow for special display modes in which the exact underlying string, including control characters, is made visible.

Yuk. That's a lot to think about for anyone considering writing a programmers' text editor with /serious/ Unicode support. Jill

Simply allow the text editor to save as either NFC or NFD, and let the programming language sort out the rest.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

Reply via email to