------- Additional Comments From joseph at codesourcery dot com  2004-12-16 
02:54 -------
Subject: Re:  UCNs not recognized in identifiers
 (c++/c99)

On Thu, 16 Dec 2004, zack at codesourcery dot com wrote:

> Joseph Myers insists that this situation cannot arise, because
> C99/C++'s lists of valid Unicode code points in identifiers exclude
> all combining forms.  But if I enforce those rules users will hate the

(That is, that they exclude combining forms for languages where the 
precomposed forms are made available, so reducing the uniqueness issues.  
Given that, for example, the definition of NFC has itself since been found 
to be defective <http://www.unicode.org/review/pr-29.html>, albeit for 
examples that cannot occur in real languages, this is not a theorem about 
what might be done with general combinations of the characters listed as 
valid.)

And also that:

* The combining rules are not part of what C99 or C++ normatively 
reference.

* Characters looking identical can occur without the combining characters.  
For this reason - distinguishing U+0041 LATIN CAPITAL LETTER A, U+0391 
GREEK CAPITAL LETTER ALPHA, U+0410 CYRILLIC CAPITAL LETTER A, for example 
- I think compiler diagnostics (and probably linker diagnostics too) 
should either default to showing \u or \U sequences rather than raw 
identifiers, or at least have an option so to do.

(Previous threads on gcc-patches and gcc, Oct-Nov 2002.)

> compiler, because their text editors will generate what looks like
> perfectly fine text and then the compiler will barf on it.  And I am
> not prepared to trust that every editor on the planet will adhere to
> C99/C++'s rules.  And even if I were, we'd still have the problem of
> the C99 and C++ lists not being identical.

I do not expect such user complaints simply because I don't expect users 
to be widely trying to use extended characters (with or without UCNs) in 
identifiers within the next several years.  (Extended characters in 
strings and comments are another matter, but don't cause such problems.)  
I'd say implement the rules if someone wishes to do so - complete with the 
previous and following oddities - then try to get things cleaned up for 
the next major revisions of C and C++.

Oddities:

1. Lexing UCNs in identifiers can require up to nine characters 
backtracking:

a\U000000Cz

is three preprocessing tokens {a}{\}{U000000Cz}.

2. (A separate general UCN issue, nothing to do with their use in 
identifiers so in no way required for implementing them in identifiers.)

C++, but not C, converts all extended characters in the source file to 
UCNs in phase 1, so stringising "$" generates different results in C and 
C++ <http://gcc.gnu.org/ml/gcc-patches/2003-04/msg01523.html>.  (Doing 
this efficiently does mean only making this UTF-8 -> UCN conversion if the 
file contains extended characters, ideally only if it contains them 
outside comments.)



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9449

Reply via email to