Re: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

Peter Kirk Wed, 10 Dec 2003 06:31:16 -0800

On 10/12/2003 02:41, [EMAIL PROTECTED] wrote:

Quoting Peter Kirk <[EMAIL PROTECTED]>:

OK, as a C function handling wchar_t arrays it is not expected to conform to Unicode. But if it is presented as a function available to users for handling Unicode text, for determining how many characters (as defined by Unicode) are in a string, it should conform to Unicode, including C9.

If a function is presented as a function available to users for handling Unicode text then it should do whatever it claims to do.

That's not what the standard says. According to C7:

C7 A process shall interpret a coded character representation according to the character semantics established by this standard, if that process does interpret that coded character representation. • This restriction does not preclude internal transformations that are never visible external to the process.

So, "If a function is presented as a function available to users for handling Unicode text", it has to do so in accordance with the standard, and is not free to do something else even if it openly claims to do that something else. (I understand "users" here as separate processes; Unicode conformance does not restrict internal functions.) And there is a clear intention that processes ought to treat all canonically equivalent strings identically, although there is a get-out clause allowing non-ideal implementations not to do so.

A process is permitted to offer a function which distinguishes between canonically equivalent forms, but, by C9, no other process is permitted to rely on this distinction. This seems paradoxical but is actually rather sensible. Such a distinction should only be made as an accidental feature of a non-ideal version of a function, perhaps one which makes no claim to support the whole of Unicode, and ideally such a function should be replaced over time by an upgraded version which supports the whole of Unicode and makes no distinction between canonically equivalent forms.

There are concepts of "code units", "code points", "characters", and "default grapheme clusters" in Unicode. Functions which count either of these are perfectly conformant with Unicode, as long as the perform their task correctly.

I fully agree with you on "default grapheme clusters", a concept which is invariant under canonically equivalent transformations (that is right, isn't it?). These need to be counted by renderers and perhaps in other circumstances e.g. this is probably the right thing to count for a character count as an estimate of the length of a text.

As for counting "code units", "code points" and "characters", we need to distinguish different levels here. Of course it is necessary to count such things internally within an implementation of certain Unicode functions e.g. normalisation, and when allocating memory space. At this level we are talking about a data type consisting of bytes or words for one of the UTF's; we are not really talking about Unicode strings. Obviously the wcslen function as originally discussed is supposed to work at this level, and there is no problem with that. The problem comes when the function is reapplied as a count of the length of a Unicode string. For one thing, it is going to give the wrong answer unless it uses 32-bit (well, 21-bit or more) words, as it certainly shouldn't be hacked to recognise surrogates. But the other problem is that to use this function with Unicode strings is to confuse different data types.

I was implicitly thinking in terms of a higher level and more abstract data type of a Unicode string. That is the level of abstraction which should be offered to users i.e. other processes or application programmers, by, for example, a general purpose Unicode-compatible string handling and I/O library. Such a Unicode string data type should be independent of encoding form; the choice between UTF-8/16/32 etc should be left to the compiler. C9 implies that it should also "ideally" be independent of canonically equivalent form of the text, and this ideal can easily (though maybe not efficiently) be attained by automatically normalising all strings passed to and from the library. (Indeed one might even build into the data type definition an automatic normalisation process, used whenever a string is stored, but I will assume that this is not done.) Within such a context, a library function to determine whether a string is normalised is meaningless, and will always return TRUE; and this is completely conformant to C9.

Within the functions associated with the data type, rather than as an external process or library function, there might be a place for a normalisation test function. On the other hand, at this level it is redundant, as the preferred thing to do with a non-normalised string is always to normalise it (or are there security-related cases where this does not apply?); and so if a string is required to be normalised, even if there is a good chance that it already is normalised, the correct thing to do is to normalise it again (and the normalisation function, operating at a lower level, may for efficiency first check normalisation before applying the full procedure).

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

Reply via email to