On Wed, 4 Dec 2002, Keld Jørn Simonsen wrote:

> On Tue, Dec 03, 2002 at 10:33:19PM -0800, H. Peter Anvin wrote:

> > Maybe a --normalize-utf option to the linker might be a good idea, but
> > it should be an option, IMO.
>
> First of all, the standard does not refer to Unicode, but to 10646.
> And the C standard does not use Unicode normalization.
> There is a list in the ISO C standard of 10646 characters that are
> allowed in identifiers, and these do not have alternate representations.

  Thank you for the note.

  I found FCD of ISO/IEC 9899 1999 (N2794 at
http://wwwold.dkuug.dk/jtc1/sc22/open/n2794). It dates from Aug.,
1998.  In Annex I 'Universal Character names for identifiers'(page
487. If you use Acroread  to view PDF version, it's 499), a set of
characters allowed are listed. (More or less identical list is found at
http://std.dkuug.dk/TC1/SC22/WG20/docs/standards#10176) Basically ISO C99
seems to avoid problems arising from multiple representation issues by
allowing only precomposed characters in identifiers(is there any change in
this regard in the finally approved ISO/IEC 9899 1999?) Keld's statement
that they do not have alternate representations is not right.
If that's the case, characters like 'Latin Small Letter with Macron'
or 'Hangul Syllable Gga' for which there are alternate representations
should not be present in the list, but they are listed as allowed.

  What ISO C99 seems to do is to shift the burden of normalization to
editors or whatever tool used by programmers to edit source files from
compilers and linkers.  That's fine(editors can do that) and is perhaps
a wise decision (preventing potential troubles from propagating thru
a compiler-linker chain at the earliest stage by issuing an error and
stopping compilation), but there's a little trouble with allowing only
precomposed characters. Both ISO/IEC JTC1/SC2/WG2 and UTC would not encode
any more precomposed characters which can be represented with exisitng
base characters followed by one or more combining characters. However,
'combining diacritical marks'(e.g. \u0300 - \u0362) are not allowed in
identifiers  so that 'any character' that's not encoded as a precomposed
form can't be used in identifiers. Some people would resent not being able
to use 'their characters' in identifiers and may use it to make a case for
encoding precomposed forms of theirs in ISO 10646.  How about references
to filenames (as in '#include directive') with combining diacritic
marks that are parts of characters NOT encoded in precomposed form?
Aha, they can use '\unnnn, or \Unnnnnnnn)...

  Jungshik



--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to