------- Additional Comments From zack at gcc dot gnu dot org 2005-01-07 07:10 ------- Joseph - I never properly answered your question in comment #7, although arguably the answer is already in comment #4.
I should mention I take as a basic premise that without exception, a sequence of UCNs and a sequence of extended-source-character-set characters (which both encode the same sequence of ISO10646 code points) should be treated identically. Therefore, I'm going to talk exclusively about code points below. The scenario that causes ABI breakage is as follows: 1) A shared library author gives an exported interface function a name containing, for instance, U+212B ANGSTROM SIGN. 2) This is compiled with a compiler that, hewing to the letter of the standard, does not perform any normalization. The shared library's symbol table therefore also contains U+212B. That code point is now part of the library ABI. 3) A program that uses this library is compiled with the same compiler; it expects a symbol containing U+212B. 4) Later, someone recompiles the library with a compiler that applies NFC to all identifiers. The library now exports a symbol containing U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE. The program compiled in step 3 breaks. An obvious rebuttal to this is that the compiler used in step 4 is broken. As you say, the C standard references ISO10646 not Unicode and the concept of normalization does not exist in ISO10646, and this could be taken to imply that no normalization shall occur. However, there is no unambiguous statement to that effect in the standard, and there is strong quality-of-implementation pressure in the opposite direction. Put aside the standard for a moment: are users going to like a compiler that insists that "Å" (U+00C5) and "Å" (U+212B) are not the same character? [It happens that on my screen those are ever so slightly different, but that's just luck - and X11 will only let me type U+00C5; I resorted to hex-editing to get the other.] Furthermore, I can easily imagine someone writing a Unicode-aware text editor and thinking it's a good idea to convert every file to NFC when saved. Making some unrelated change to the file defining the symbol with U+212B in it, with this editor, would trigger the exact same ABI break that the hypothetical normalizing compiler would. This possibility means that a WG14/21 no-normalization mandate would NOT prevent silent ABI breakage. And the existence of this possibility increases the QoI pressure for a compiler to do normalization, as a defensive measure against such external changes. You could argue that this is just another way for C programmers to shoot themselves in the foot, but I don't think the myriad ways that already exist are a reason to add more. For these reasons I see no safe way to implement extended identifiers except to persuade both WG14 and WG21 to mandate use of UAX#15 annex 7, instead of the existing lists of allowed characters. I'm willing to consider other normalization schemas and sets of allowed characters (as long as C and C++ are consistent with each other) but not plans which don't include normalization. To address the concern about requiring huge tables, perhaps the standards could say that it is implementation-defined whether extended characters are allowed at all. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9449