[Bug c/67224] UTF-8 support for identifier names in GCC

joseph at codesourcery dot com Thu, 20 Aug 2015 15:42:06 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224


--- Comment #21 from joseph at codesourcery dot com <joseph at codesourcery dot 
com> ---
_cpp_interpret_identifier converts UCNs to UTF-8 which is the canonical 
internal form for identifiers - for UTF-8 in identifiers, you just need to 
pass in straight through unmodified there.  (cpplib takes care to store 
the original spelling of the identifier as well for purposes for which 
that matters, but that's simply a matter of lex_identifier calling 
cpp_lookup on the original spelling as well as using 
_cpp_interpret_identifier to get the canonical version.)

So you never need to convert UTF-8 to UCNs in order to handle UTF-8 in 
identifiers (cpplib has logic to do so when needed for output, but you 
don't need to add anything new in that regard).  You do need to decode 
UTF-8 into character values for the code that checks normalization, which 
characters are allowed at the start of identifiers, etc., just as the 
existing code decodes UCNs into such values.  (But as I noted, a UCN not 
allowed in identifiers is lexed as part of an identifier, which is then 
considered invalid, whereas a UTF-8 character not allowed in identifiers 
should be lexed as a separate pp-token.  However, UTF-8 for a character 
allowed in identifiers but not at the start of an identifier should, I 
think, be lexed as an identifier character even at the start of an 
identifier, and then give an error for an invalid identifier if it appears 
at the start of an identifier.  That's my reading of the syntax 
productions in the C standard.)

You can ignore anything claiming to handle UTF-EBCDIC.

[Bug c/67224] UTF-8 support for identifier names in GCC

Reply via email to