------- Additional Comments From zack at gcc dot gnu dot org  2005-01-07 07:10 
-------
Joseph - I never properly answered your question in comment #7, although
arguably the answer is already in comment #4.

I should mention I take as a basic premise that without exception, a sequence of
UCNs and a sequence of extended-source-character-set characters (which both
encode the same sequence of ISO10646 code points) should be treated identically.
 Therefore, I'm going to talk exclusively about code points below.

The scenario that causes ABI breakage is as follows:

1) A shared library author gives an exported interface function a name
containing, for instance, U+212B ANGSTROM SIGN.

2) This is compiled with a compiler that, hewing to the letter of the standard, 
does not perform any normalization.  The shared library's symbol table therefore
also contains U+212B.  That code point is now part of the library ABI.

3) A program that uses this library is compiled with the same compiler; it
expects a symbol containing U+212B.

4) Later, someone recompiles the library with a compiler that applies NFC to all
identifiers.  The library now exports a symbol containing U+00C5 LATIN CAPITAL
LETTER A WITH RING ABOVE.  The program compiled in step 3 breaks.

An obvious rebuttal to this is that the compiler used in step 4 is broken.  As
you say, the C standard references ISO10646 not Unicode and the concept of
normalization does not exist in ISO10646, and this could be taken to imply that
no normalization shall occur.  However, there is no unambiguous statement to
that effect in the standard, and there is strong quality-of-implementation
pressure in the opposite direction.  Put aside the standard for a moment: are
users going to like a compiler that insists that "Å" (U+00C5) and "Å" 
(U+212B)
are not the same character?  [It happens that on my screen those are ever so
slightly different, but that's just luck - and X11 will only let me type U+00C5;
I resorted to hex-editing to get the other.]

Furthermore, I can easily imagine someone writing a Unicode-aware text editor
and thinking it's a good idea to convert every file to NFC when saved.  Making
some unrelated change to the file defining the symbol with U+212B in it, with
this editor, would trigger the exact same ABI break that the hypothetical
normalizing compiler would.  This possibility means that a WG14/21
no-normalization mandate would NOT prevent silent ABI breakage.  And the
existence of this possibility increases the QoI pressure for a compiler to do
normalization, as a defensive measure against such external changes.  You could
argue that this is just another way for C programmers to shoot themselves in the
foot, but I don't think the myriad ways that already exist are a reason to add 
more.

For these reasons I see no safe way to implement extended identifiers except to
persuade both WG14 and WG21 to mandate use of UAX#15 annex 7, instead of the
existing lists of allowed characters.  I'm willing to consider other
normalization schemas and sets of allowed characters (as long as C and C++ are
consistent with each other) but not plans which don't include normalization.  To
address the concern about requiring huge tables, perhaps the standards could say
that it is implementation-defined whether extended characters are allowed at 
all.

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9449

Reply via email to