On Wed, 4 Dec 2002, Keld Jørn Simonsen wrote: > On Tue, Dec 03, 2002 at 10:33:19PM -0800, H. Peter Anvin wrote:
> > Maybe a --normalize-utf option to the linker might be a good idea, but > > it should be an option, IMO. > > First of all, the standard does not refer to Unicode, but to 10646. > And the C standard does not use Unicode normalization. > There is a list in the ISO C standard of 10646 characters that are > allowed in identifiers, and these do not have alternate representations. Thank you for the note. I found FCD of ISO/IEC 9899 1999 (N2794 at http://wwwold.dkuug.dk/jtc1/sc22/open/n2794). It dates from Aug., 1998. In Annex I 'Universal Character names for identifiers'(page 487. If you use Acroread to view PDF version, it's 499), a set of characters allowed are listed. (More or less identical list is found at http://std.dkuug.dk/TC1/SC22/WG20/docs/standards#10176) Basically ISO C99 seems to avoid problems arising from multiple representation issues by allowing only precomposed characters in identifiers(is there any change in this regard in the finally approved ISO/IEC 9899 1999?) Keld's statement that they do not have alternate representations is not right. If that's the case, characters like 'Latin Small Letter with Macron' or 'Hangul Syllable Gga' for which there are alternate representations should not be present in the list, but they are listed as allowed. What ISO C99 seems to do is to shift the burden of normalization to editors or whatever tool used by programmers to edit source files from compilers and linkers. That's fine(editors can do that) and is perhaps a wise decision (preventing potential troubles from propagating thru a compiler-linker chain at the earliest stage by issuing an error and stopping compilation), but there's a little trouble with allowing only precomposed characters. Both ISO/IEC JTC1/SC2/WG2 and UTC would not encode any more precomposed characters which can be represented with exisitng base characters followed by one or more combining characters. However, 'combining diacritical marks'(e.g. \u0300 - \u0362) are not allowed in identifiers so that 'any character' that's not encoded as a precomposed form can't be used in identifiers. Some people would resent not being able to use 'their characters' in identifiers and may use it to make a case for encoding precomposed forms of theirs in ISO 10646. How about references to filenames (as in '#include directive') with combining diacritic marks that are parts of characters NOT encoded in precomposed form? Aha, they can use '\unnnn, or \Unnnnnnnn)... Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/