[Bug c/67224] UTF-8 support for identifier names in GCC

joseph at codesourcery dot com Tue, 06 Aug 2019 10:42:06 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #28 from joseph at codesourcery dot com <joseph at codesourcery dot 
com> ---
On Mon, 22 Jul 2019, lhyatt at gmail dot com wrote:

> I am interested in helping out with this if there is still interest to support
> this feature. (Full disclosure, I don't have any experience with the gcc
> codebase, but I do have a lot of experience developing with gcc.)

Thanks for working on this!  I encourage sending this to gcc-patches once 
a few fixes have been made and you've done the legal paperwork, see 
<https://gcc.gnu.org/contribute.html>.

I'm wary of the MIN use in _cpp_lex_direct, as this is 
performance-critical code so it's not clear an extra operation should be 
added for every token.  I'd rather put the check for UTF-8 in the default 
case (a case that should in practice be rare), with a goto from there to 
the case of identifiers.

As a coding style matter, note that in various places sentences in 
comments should start with a capital letter.

> 5. There is a problem with diagnostics output when the source contains UTF-8
> characters. The locator caret ends up in the wrong place, I assume just 
> because
> this code is not aware of the multibyte encoding. That much is not specific to
> my patch, it exists already now e.g. with:

This seems like it should have a separate bug filed for it (I don't see 
any currently open bugs for this issue).

> The bigger problem though is in layout::print_source_line() which colorizes 
> the
> source lines. It seems to end up attempting to colorize just the first byte,
> even for UTF-8, which makes the output no longer valid. I tried to look into 
> it
> but I wasn't sure what are the implications, e.g. would it require some much
> larger overhaul of diagnostics infrastructure anyway to get this right, and
> would it perhaps be better just to disable colorization in the presence of
> UTF-8 input or something like this, for the meantime.

And this should probably also have a separate bug filed (whether or not it 
can occur without this patch applied).

> This is also not specific to this patch and occurs the same if UCN
> is used:

This also seems like a matter for filing a separate bug.  Or maybe two 
separate bugs, one for C and one for C++, since the fixes might be 
different.  For C, the suggestion of \xcf\x80 looks like a missing call to 
identifier_to_locale when printing an identifier using %qs - but the C++ 
code is using %qE, which should use identifier_to_locale automatically, so 
I'm not sure what's wrong there.

> 7. What is the expected output from gcc -E of this code?
> 
> -------
> int π;
> --------
> 
> Currently it outputs:
> int \U000003c0;
> 
> So curiously, it's as if C++ required translation of extended chars to UCNs is
> happening, so I think this output is actually potentially correct in C++ mode?
> But it is also this way in C mode which I think is probably not expected. It
> seems to come from cpp_output_token() which does not make use of the "original
> spelling" data structures. I am not sure about this one but probably the right
> solution is not much work, if someone knows what that might be?

I don't think the -E output matters much here; it's not specified by the 
standard.

The results of stringizing *are* more precisely defined (the relevant 
tests stringize twice to verify those results).  Strictly, for C++ 
stringizing twice (for extended characters including $ @ `) should make 
the conversion of such characters to UCNs visible (in strings, not just in 
identifiers), because, unlike C, C++ does not have the special rule making 
it implementation-defined whether the \ of a UCN in a string literal is 
doubled when stringizing.  I don't think that's something you need to fix, 
however, since there's no attempt to implement that conversion for C++ at 
present, but it does make a couple of the C++ tests in your patch strictly 
invalid.

> This is also the reason that one of the new testcases
> (gcc/testsuite/gcc.dg/cpp/ucnid-13-utf8.c) fails, this:
> 
> #define Á 1
> 
> also preprocesses (in -E -dD) to include UCNs. I am not sure what is expected
> here.

There is definitely no need to preserve spelling there (it's not even 
possible in general, since the same macro name can be spelt differently in 
otherwise identical definitions of the same macro; it's only a constraint 
violation if either macro argument names or the RHS are different, not if 
the name of the macro itself is spelt differently).  So the right thing is 
to test that the output in that case uses a UCN.

> 8. There are tests (e.g. gcc/testsuite/gcc.dg/ucnid-10.c) which verify that
> when the locale is not utf8, diagnostics use UCNs instead of raw UTF8. I am 
> not
> sure if this still makes sense when the files themselves contain UTF8, but 
> that
> was the behavior that came out so I maintained these tests as well.

Yes, I think that's correct.

[Bug c/67224] UTF-8 support for identifier names in GCC

Reply via email to