https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224
--- Comment #21 from joseph at codesourcery dot com <joseph at codesourcery dot com> --- _cpp_interpret_identifier converts UCNs to UTF-8 which is the canonical internal form for identifiers - for UTF-8 in identifiers, you just need to pass in straight through unmodified there. (cpplib takes care to store the original spelling of the identifier as well for purposes for which that matters, but that's simply a matter of lex_identifier calling cpp_lookup on the original spelling as well as using _cpp_interpret_identifier to get the canonical version.) So you never need to convert UTF-8 to UCNs in order to handle UTF-8 in identifiers (cpplib has logic to do so when needed for output, but you don't need to add anything new in that regard). You do need to decode UTF-8 into character values for the code that checks normalization, which characters are allowed at the start of identifiers, etc., just as the existing code decodes UCNs into such values. (But as I noted, a UCN not allowed in identifiers is lexed as part of an identifier, which is then considered invalid, whereas a UTF-8 character not allowed in identifiers should be lexed as a separate pp-token. However, UTF-8 for a character allowed in identifiers but not at the start of an identifier should, I think, be lexed as an identifier character even at the start of an identifier, and then give an error for an invalid identifier if it appears at the start of an identifier. That's my reading of the syntax productions in the C standard.) You can ignore anything claiming to handle UTF-EBCDIC.